Re: [Wikitech-l] Jake requests enabling access and edit access to Wikipedia via TOR
On Tue, Dec 31, 2013 at 1:08 AM, Martijn Hoekstra martijnhoeks...@gmail.com wrote: Does Jake have any mechanism in mind to prevent abuse? Is there any possible mechanism available to prevent abuse? Preventing abuse is the wrong goal. There is plenty of abuse even with all the privacy smashing new editor deterring convolutions that we can think up. Abuse is part of the cost of doing business of operating a publicly editable Wiki, it's a cost which is normally well worth its benefits. The goal needs to merely be to limit the abuse enough so as not to upset the abuse vs benefit equation. Today, people abuse, they get blocked, they go to another library/coffee shop/find another proxy/wash rinse repeat. We can't do any better than that model, and it turns out that it's okay. If a solution for tor users results in a cost cost (time, money, whatever unit of payment is being expended) for repeated abuse comparable to the other ways abusive people access the site then it should not be a major source of trouble which outweighs the benefits. (Even if you do not value freedom of expression and association for people in less free parts of the world at all). ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Jake requests enabling access and edit access to Wikipedia via TOR
On Sun, Jan 12, 2014 at 6:36 PM, Jasper Deng jas...@jasperswebsite.com wrote: This question is analogous to the question of open proxies. The answer has universally been that the costs (abuse) are just too high. No, it's not analogous to just permitting open proxies as no one in this thread is suggesting just flipping it on. I proposed issuing blind exemption tokens up-thread as an example mechanism which would preserve the rate limiting of abusive use without removing privacy. However, we might consider doing what the freenode IRC network does. Freenode requires SASL authentication to connect on Tor, which basically means only users with registered accounts can use it. The main reason for hardblocking and not allowing registered accounts on-wiki via Tor is that CheckUsers need useful IP data. But it might be feasible if we just force all account creation to happen on real IPs, although that still hides some data from CheckUsers. What freenode does is not functionally useful for Tor users. In my first hand experience it manages to enable abusive activity while simultaneously eliminating Tor's usefulness at protecting its users. The only value it provides is providing a pretext of tor support without actually doing something good... and we already have the you can get an IPblock-exempt (except you can't really, and if you do it'll get randomly revoked. if all we want is a pretext. :) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Jake requests enabling access and edit access to Wikipedia via TOR
On Mon, Dec 30, 2013 at 6:10 PM, Tyler Romeo tylerro...@gmail.com wrote: On Mon, Dec 30, 2013 at 7:34 PM, Chris Steipp cste...@wikimedia.org wrote: I was talking with Tom Lowenthal, who is a tor developer. He was trying to convince Tilman and I that IP's were just a form of collateral that we implicitly hold for anonymous editors. If they edit badly, we take away the right of that IP to edit, so they have to expend some effort to get a new one. Tor makes that impossible for us, so one of his ideas is that we shift to some other form of collateral-- an email address, mobile phone number, etc. Tilman wasn't convinced, but I think I'm mostly there. This is a viable idea. Email addresses are a viable option considering they take just as much (if not a little bit more) effort to change over as IP addresses. We can take it even a step further and only allow email addresses from specific domains, i.e., we can restrict providers of so-called throwaway emails. Email is pretty shallow collateral, esp if you actually allow email providers which are materially useful to people who are trying to protect their privacy. Allowing e.g. only email providers which require SMS binding, for example, would be pretty terrible... This is doubly so because the relationship is discoverable: e.g. you only really wanted to use the email to provide scarcity but because it was provided it could be used to deanonymize the users. (Even if you intentionally didn't log the email-user mapping, it would end up being deanonymized-by-time in database backups; or could be secretly logged at any time, e.g. via compromised staff) FAR better than this can be done without much more work. Digging up an old proposal of mine… A proposal for more equitable access to ipblock-exempt. In the Jake requests enabling access and edit access to Wikipedia via TOR thread on wikitech-l[http://lists.wikimedia.org/pipermail/wikitech-l/2013-December/073764.html] the issue of being able to edit Wikipedia via TOR was highlighted. Some people appear to have mistaken this thread as being specifically about Jake. This isn't so— Jake is technologically sophisticated and has access to many technical and social resource. Jake-the-person can edit Wikipedia, with suitable effort. But Jake-as-a-proxy-for-other-tor-users has a much harder time. Ipblock-exempt as implemented today doesn't— as demonstrated [http://lists.wikimedia.org/pipermail/wikitech-l/2013-December/073773.html] —even work for Jake. It certainly doesn't work for more typical users. Many people believe that Wikipedia has become so socially important that being able to edit it— even if just to leave talk page comments— is an essential part of participating in worldwide society. Unfortunately, not all people are equally free and some can only access Wikipedia via anti-censorship technology or can only speak without fear of retaliation via anonymity technology. Wikipedia must balance the interests of preventing abuse and enabling the sharing of knowledge. Only so much can be accomplished by prohibiting access to tor entirely: Miscreants can and do use paid VPNs and compromised hosts to evade blocks on a constant basis. Ironically, abusive users who are unconcerned about breaking the law have an easier time editing Wikipedia then people simple concerned with unlawful surveillance. That isn't a balance. In order to better balance these interests, I propose the following technical improvement: A new special page should be added with a form which takes an unblocked username and which accepts a base64 encoded message which contains a random serial number and a RSA digital signature with a well known Wikimedia controlled private key, we'll call this message an exemption token. If the signature passes and the serial number has never been seen before, the serial number is saved, and Ipblock-exempt is set on the account. Additionally, the online donation process is updated with some client side JS so that for every $10 donated the client picks a random value, cryptographically blinds the random value [https://en.wikipedia.org/wiki/Blind_signature#Blind_RSA_signatures.5B2.5D:235], and submits the blinded values along with the donation. When the donation is successful, the donation server signs the blinded values and returns them and the clients unblind them and present the messages to the users. [RSA blinding is no more complicated to implement than RSA signing in general. It requires a modular exponentiation and multiply and a modular inversion] The donor is free to save the messages, give them out to friends, or press some button to give them to the tor project. Each message entitles one account to be exempted, and Wikimedia is unable to associate donations with accounts due to the blinding. Finally, the block notice should direct people to a page with instructions on obtaining exemption tokens. This process would provide a guaranteed bound on the amount of abusive use of ipblock-exempt.
Re: [Wikitech-l] Live stream from Wikimania 2010 about MediaWiki
On Sun, Jul 11, 2010 at 5:42 AM, Siebrand Mazeland s.mazel...@xs4all.nl wrote: Hi, Just to inform you about the NOW running live streams from Wikimania about MediaWiki. See http://toolserver.org/~reedy/wikimania2010/jazzhall.html Runs until 13.00 CEST TODAY/NOW! Shame. This requires some plugin stuff. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Live stream from Wikimania 2010 about MediaWiki
On Sun, Jul 11, 2010 at 6:29 AM, Erik Moeller e...@wikimedia.org wrote: I am hugely grateful that we have reliable streaming this year, thanks to a lot of volunteer effort. Perhaps we can defer the ideological nitpicking and just share that appreciation. I would be grateful even if it required a Windows-only plugin, which Flash is not. I've been working from a non-x86 system the past couple of days. Even if I wanted to install the proprietary flash software I couldn't. The time delayed uploaded files worked pretty well last year, and I was able to watch all the presentations I was interested in. This year I wasn't able to watch a single one. This isn't merely ideology. But even if it were, ideology doesn't mean without practical value, ideology can often mean preferring a strategy believe to be practically superior over the long term in preference to some short term expedience. I presume you pursue long-term winning strategies over the best immediate gain constantly through your life and don't consider these decisions to be ideological, much less nitpicking. I suppose it is valuable information to know that you have so little respect for my opinions, though I would have preferred to learn of this someplace other than on a public mailing list. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Reject button for Pending Changes
On Sun, Jun 27, 2010 at 2:48 PM, Rob Lanphier ro...@wikimedia.org wrote: [snip] look at the revision history. However, this should be reasonably rare, and the diff remains in the edit history to be rescued, and can be reapplied if need be. A competing problem is that disabling the reject button will Do you have a any data to support your rarity claim beyond the fact that reviews spanning multiple revisions are themselves rare to the point of non-existence on enwp currently? Why is rarity a good criteria to increase the incidence of blind reversion of good edits? An informal argument here is that many contributors will tell you that if their initial honest contributions to Wikipedia had been instantly reverted they would not have continued editing— and so extreme caution should be taken in encouraging blind reversion unless it is urgently necessary. Current review delays on enwp are very short what is the urgency for requiring a mechanism for _faster_ reversions of edits which are not being displayed to the general public? Could the goal of reducing the unapprove button be equally resolved by removing the unapprove button from the review screen where it is confusingly juxtaposed with the approve button and instead display it on the edit history next to the text indicating which revisions have the reviewed state? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Broken validation statistics
Is anyone working on fixing the broken output from http://en.wikipedia.org/wiki/Special:ValidationStatistics ? I brought this up on IRC a week-ish ago and there was some speculation as to the cause but it wasn't clear to me if anyone was working on fixing it. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Reject button for Pending Changes
On Sun, Jun 27, 2010 at 6:04 PM, Rob Lanphier ro...@robla.net wrote: On Sun, Jun 27, 2010 at 12:12 PM, Gregory Maxwell gmaxw...@gmail.comwrote: On Sun, Jun 27, 2010 at 2:48 PM, Rob Lanphier ro...@wikimedia.org wrote: [snip] look at the revision history. However, this should be reasonably rare, and the diff remains in the edit history to be rescued, and can be reapplied if need be. A competing problem is that disabling the reject button will Do you have a any data to support your rarity claim beyond the fact that reviews spanning multiple revisions are themselves rare to the point of non-existence on enwp currently? I don't have that data. However, let me put it another way. We have a known problem (many people confused/frustrated by the lack of an enabled reject button), which we're weighing against a theoretical and currently unquantified problem (the possibility that an intermediate pending revision should be accepted before a later pending revision is rejected). I don't think it's smart for us to needlessly disable this button in the absence of evidence showing that it should be disabled. I think you've failed to actually demonstrate a known problem here. The juxtaposition of the approve and unapproved can be confusing, I agree. In most of the discussions where it has come up people appear to have left satisfied once it was explained to them that 'rejecting' wasn't a tool limited to reviewers— that everyone can do it using the same tools that they've always used. Or, in other words, short comings in the current interface design have made it difficult for someone to figure out what actions are available to them, and not that they actually have any need for more potent tools to remove contributions from the site. I think it's important to note that reverting revisions is a regular editorial task that we've always had which pending changes has almost no interaction with. If there is a need for a one click multi-contributor multi-contribution bulk revert why has it not previously been implemented? Moreover, you've selectively linked one of several discussions — when in others it was made quite clear that many people (myself included, of course) consider a super-rollback undo everything pending button to be highly undesirable. Again— I must ask where there is evidence that we are in need of tools to increase the _speed_ of reversion actions on pages with pending changes at the expense of the quality of those determinations? Feel free to point out if you don't actually believe a bulk revert button would be such a trade-off. The current spec doesn't call for blind reversion. It has a confirmation screen that lists the revisions being reverted. I don't think it's meaningful to say that a revert wasn't blind simply because the reverting user was exposed to a list of user names, edit summaries, and timestamps (particularly without immediate access to the diffs). A blind revert is a revert which is made without evaluating the content of the change. Such results are possible through the rollback button, for example, but they rollback is limited to the contiguous edits by a single contributor. Blind reverts can also be done by selecting an old version and saving it, but that takes several steps and the software cautions you about doing it. The removal of rollback privileges due to excessively sloppy use is a somewhat frequent event and the proposed change to the software is even more risky. These bulk tools also remove the ability to provide an individual explanation for the removal of each of the independent changes. I think making accept/unaccept into a single toggling button is the right thing to do. Because of page load times by the time I get the review screen up someone has often approved the revision. If I am not maximally attentive will I now accidentally unapprove a fine version of the page simply because the button I normally click has reversed its meaning? This doesn't seem especially friendly to me. Or, A user interface is well-designed when the program behaves exactly how the user thought it would, and this won't. Furthermore, because of the potentially confusing result of unaccepting something, I'd even recommend only making it possible when looking at the diff between the penultimate accepted revision and the latest accepted revision, which is documented in this request: http://www.pivotaltracker.com/story/show/3949176 That sounds good to me. Though the review screen which you'd visit with the intent of reviewing a change fits that description and if you change the meaning of a commonly used button it will result in errors of the form I just raised. However, I don't think that removes the need for a reject button, for reasons I outline here: http://flaggedrevs.labs.wikimedia.org/wiki/Wikimedia_talk:Reject_Pending_Revision At the DC meetup yesterday someone used the explanation Pending changes is an approval of a particular
Re: [Wikitech-l] Reject button for Pending Changes
On Sun, Jun 27, 2010 at 9:59 PM, Gregory Maxwell gmaxw...@gmail.com wrote: Moreover, you've selectively linked one of several discussions — when in others it was made quite clear that many people (myself included, of course) consider a super-rollback undo everything pending button to be highly undesirable. Someone asked me off list to provide an example, so here is one: http://en.wikipedia.org/wiki/Wikipedia_talk:Reviewing#What_gets_flagged_and_what_does_not ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Problem with the pending changes review screen.
Imagine an article with many revisions and pending changes enabled: A, B, C, D, E, F, G... A is an approved edit. B,C,D,E,F,G are all pending edits. B is horrible vandalism that the subsequent edits did not fix. You are a reviewer, you go to review page by clicking a pending review link. On the review page you can accept— thus putting the horrible vandalism on the site. Or you can reject which throws out the all the good edits of C,D,E,F,G by reverting it to A. To quote someone from IRC: this seems like its going to make vandals even more effective because all they have to do is make one edit in a string of ten good ones, and then the entire set has to be thrown out But that isn't true at all. You're not confined to the review page, you simply go to the edit history, click undo on B, and then approve your own edit (it won't be auto-approved because G wasn't approved). Tada. This completely non-obvious to people, because the only options on the review page are accept or reject, and it's already causing confusion. This is a direct result of the late in the process addition of the review button, — trying to fit the round-peg of a revision reviewing system (which we can't have because of the fundamental incompatibility with single linear editing history) in to presentation-flagging system square hole that we actually have. I don't know how to fix this. We could remove the reject button to make it more clear that you use the normal editing functions (with their full power) to reject. But I must admit that the easy rollback button is handy there. Alternatively we could put a small chunk of the edit history on the review page, showing the individual edits which comprise the span-diff (bonus points for color-coding if someone wants to make a real programming project out of it) along with the undo links and such. In the meantime I expect enwp will edit the message text to direct people to the history page for more sophisticated editing activities. (Thanks to Risker for pointing out how surprising the pending review page was for this activity) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Problem with the pending changes review screen.
On Tue, Jun 15, 2010 at 11:05 PM, Gregory Maxwell gmaxw...@gmail.com wrote: Imagine an article with many revisions and pending changes enabled: A, B, C, D, E, F, G... [snip] I don't know how to fix this. We could remove the reject button to make it more clear that you use the normal editing functions (with their full power) to reject. But I must admit that the easy rollback button is handy there. Alternatively we could put a small chunk of the edit history on the review page, showing the individual edits which comprise the span-diff (bonus points for color-coding if someone wants to make a real programming project out of it) along with the undo links and such. [snip] Further discussion with Risker has caused me to realize that there is another significant problem situation with the reject button. Consider the following edit sequence: A, B, C, D, E A is a previously approved version. B, and D are all excellent edits. C and E are obvious vandalism. E even managed to undo all the good changes of B,D while adding the vandalism. A reviewer hits the pending revisions link in order to review, they get the span diff from A to E. All they see is vandalism, there is no indication of the redeeming edits in the intervening span. So they hit reject. The good edits are lost. Unlike the prior problem, the only way to solve this would be only display the REJECT button if all of the pending changes are by the same author (or limiting it to only one pending change in the span, which would be slightly more conservative but considering the behaviour of the rollback button I think the group-by-author behaviour would be fine). The accept button is still safe. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Problem with the pending changes review screen.
On Tue, Jun 15, 2010 at 11:38 PM, Carl (CBM) cbm.wikipe...@gmail.com wrote: On Tue, Jun 15, 2010 at 11:30 PM, Gregory Maxwell gmaxw...@gmail.com wrote: Consider the following edit sequence: A, B, C, D, E A is a previously approved version. B, and D are all excellent edits. C and E are obvious vandalism. E even managed to undo all the good changes of B,D while adding the vandalism. The only way to handle this sort of thing is to actually look at the intermediate edits. I don't know if there is a nice way to simplify that workflow, but it points me towards the idea that reviewing should be done off the history page, not directly off a list of unreviewed pages. This is how the software worked until recently. :( I feel foolish for not catching this until now even though I was aware of the addition of the reject button. Sorry. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Updating strings for FlaggedRevs for the Flagged Protection/Pending Revisions/Double Check launch
On Sat, May 22, 2010 at 2:13 PM, Rob Lanphier ro...@wikimedia.org wrote: Hi everyone, I'm preparing a patch against FlaggedRevs which includes changes that Howie and I worked on in preparation for the launch of its deployment onto en.wikipedia.org . We started first by creating a style guide describing how the names should be presented in the UI: http://en.wikipedia.org/wiki/Wikipedia:Flagged_protection_and_patrolled_revisions/Terminology [snip] I'm concerned that the simplified graphical explanation of the process fosters the kind of misunderstanding that we saw in the first slashdot threads about flagged revision... particularly the mistaken belief that the process is synchronous. People outside of the active editing community have frequently raised the same concerns on their exposure to the idea of flagged revisions. Common ones I've seen Won't people simply reject changes so they can make their own edits? Who is going to bother to merge all the unreviewed changes on a busy article, they're going to lose a lot of contributions! None of these concerns really apply to the actual implementation because it's the default display of the articles which is controlled, not the ability to edit. There is still a single chain of history and the decision to display an article happens totally asynchronously with the editing. The illustration still fosters the notion of some overseeing gatekeeper on an article expressing editorial control— which is not the expected behaviour of the system, nor a desired behaviour, nor something we would even have the resources to do if it were desirable. In particular there is no per-revision analysis mandated by our system: Many edits will happen, then someone with the right permissions will look at a delta from then-to-now and decide that nothing is terrible in the current version and make it the displayed version. It's possible that there were terrible intermediate versions, but it's not relevant. I have created a poster suitable for distribution to journalists http://myrandomnode.dyndns.org:8080/~gmaxwell/flagged_protection.png (Though the lack of clarity in the ultimate naming has made it very difficult to finalize it. If anyone wants it I can share SVG/PDF versions of it). ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Updating strings for FlaggedRevs for the Flagged Protection/Pending Revisions/Double Check launch
On Sat, May 22, 2010 at 5:09 PM, Gregory Maxwell gmaxw...@gmail.com wrote: I have created a poster suitable for distribution to journalists http://myrandomnode.dyndns.org:8080/~gmaxwell/flagged_protection.png I have revised the graphic based on input from Andrew Gray and others. http://myrandomnode.dyndns.org:8080/~gmaxwell/flagged_protection3.png ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Updating strings for FlaggedRevs for the Flagged Protection/Pending Revisions/Double Check launch
On Sat, May 22, 2010 at 8:17 PM, Rob Lanphier ro...@robla.net wrote: I suppose in this case, there might be a simpler debate about which is a better word: sighted, checked or accepted, since I think we actually have the same goal here (we don't want to convey anything other than someone other than an anonymous user gave this a once-over and thought it was ok to display). Accepted might imply that revisions without that flag are not accepted. This isn't actually the case. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] VP8 freed!
This is pretty far off topic, but letting fud sit around is never a good idea. On Thu, May 20, 2010 at 2:08 AM, Hay (Husky) hus...@gmail.com wrote: http://x264dev.multimedia.cx/?p=377 Apparently the codec itself isn't as good as H264, and patent problems are still likely. It's better than Theora though. You should have seen what VP3 was like when it was handed over to Xiph.Org. The software was horribly buggy, slow, and the quality was fairly poor (at least compared to the current status). Jason's comparison isn't unfair but you need to understand it for what it is— he's comparing a very raw, hardly out of development, set of tools to his own project— which is the most sophisticated and mature video encoder in existence. x264 contains a multitude of pure encoder side techniques which can substantially improve quality and which could be equally applied to VP8. For an example of the kinds of pure encoder side improvements available, take a look at the most recent improvements to Theora: http://people.xiph.org/~xiphmont/demo/theora/demo9.html Even given that, VP8's performance compared to _baseline profile_ H.264 is good. Jason describes it as relatively close to x264’s Baseline Profile. Baseline profile H.264 is all you can use on the if you actually want to be compatible with a great many devices, including the iphone. There are half research codecs that encode and decode at minutes per frame and simply blow away all of this stuff. VP8 is more computationally complex than Theora, but roughly comparable to H.264 baseline. And it compares pretty favourably with H.264 baseline, even without an encoder that doesn't suck.This is all pretty good news. On the patent part— Simply being similar to something doesn't imply patent infringement, Jason is talking out of his rear on that point. He has no particular expertise with patents, and even fairly little knowledge of the specific H.264 patents as his project ignores them entirely. Codec patents are, in general, excruciatingly specific — it makes passing the examination much easier and doesn't at all reduce the patent's ability to cover the intended format because the format mandates the exact behaviour. This usually makes them easy to avoid. It's easy to say that VP8 has increased patent exposure compared to Theora simply by virtue of its extreme newness (while Theora is old enough to itself be prior art against most of the H.264 pool), but I'd expect any problems to be in areas _unlike_ H.264 because the similar areas would have received the most intense scrutiny. ... and in any case, Google is putting their billion dollar butt on the line— litigation involving inducement to infringe on top of their own violation could be enormous in the extreme. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Vector skin not working on BlackBerry?
On Thu, May 13, 2010 at 3:16 PM, David Gerard dger...@gmail.com wrote: There's a few comments on the Wikimedia blog saying they can't access en:wp any more using their BlackBerry. Though we tried it here on an 8900 and it works. Any other reports? Punching in http://en.wikipedia.org/ as I normally would... It starts to render, but with an enormous grey area at the top like a gigantic banner ad. Then the browser crashes, I assume I've never seen it do that before... it throws up a there was a problem rendering this page, blanks the screen, and goes unresponsive. Blackberry 8310, software v4.5.0.110 (Platform 2.7.0.90) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Broken videos
On Tue, Mar 16, 2010 at 7:15 AM, Lars Aronsson l...@aronsson.se wrote: So how do I tell what's wrong? I have a laptop that is less than half a year old, a clean Ubuntu Linux 9.10 install and the included Firefox 3.5.8 browser. This should work, but these two videos never play more than two seconds and after a while my CPU fan spins up, firefox runs 100%, and all I can do is a kill -9, which kills any other work I had going in other browser windows and tabs. You aren't running in a virtual machine are you? Linux+VM is known as a source of playback problems for firefox: https://bugzilla.mozilla.org/show_bug.cgi?id=526080 Otherwise, it's pretty likely you're hitting https://bugzilla.mozilla.org/show_bug.cgi?id=496147 (or another on of several closely related linux audio specific bugs which are various degrees of fixed in the latest firefox development). I believe that disabling pulseaudio will work around this collection of issues on ubuntu. On Tue, Mar 16, 2010 at 6:52 AM, Tei oscar.vi...@gmail.com wrote: Uh.. buffer overflow errors, complex file format loaders in programming languages like C Or false assumptions about memory management with poor detection error and fatal consecuences. Maybe even bad program intercomunication. ... The internet was built on text based protocols to avoid these problems or help debug then. Ironic that you say that... the variable length null terminated string is probably the worst thing to ever happen to computer security. Text does imply a degree of transparency, but it's not security cure-all. In any case, video and audio are in the same boat as Jpeg/png, +/- some differences in software maturity. There aren't any known or expected malware vectors for them. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Broken videos
On Tue, Mar 16, 2010 at 12:27 PM, Tei oscar.vi...@gmail.com wrote: In any case, video and audio are in the same boat as Jpeg/png, +/- some differences in software maturity. There aren't any known or expected malware vectors for them. Agreed. But seems possible to generate streams of video that crash the browser. So.. probably autoplay is evil. (is already evil because is NSFW since distract coworkers ) Pegging the CPU on a fairly less than very common platform with a copy of firefox which is soon to be outdated is probably not an enormous worry. Growing pains. Of course, it's useful to submit bug reports on this stuff where ones don't already exist. If you encounter files that break Firefox, Opera, Chrome, Safari(+xiphqt) please let me know and I'll make sure that a bug gets reported. I'm also happy to fix cortado (The java fallback for clients without proper video support) bugs, — but Wikimedia is using a copy of cortado so enormously old that it's not unlikely that any problems encountered have already been fixed. In any case, none of the video on wikimedia sits is autoplay in the sense that it starts on its own. The video tag itself is set to autoplay, but the tag doesn't get inserted into the page until the user clicks. No video surprises. (Unfortunately this process doesn't give the video tag any chance to pre-buffer the video). ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] modernizing mediawiki
On Tue, Mar 2, 2010 at 11:30 PM, Chris Lewis yecheondigi...@yahoo.com wrote: I hope I am emailing this to the right group. My concern was about mediawiki and it's limitations, as well as it's outdated methods. As someone wo runs a wiki, I've gone through a lot of frustrations. If Wordpress is like Windows 7, then Mediawiki is Windows 2000. Very outdated GUI, There are many, many, many skins available. outdated ways of doing things, for example using ftp to edit the settings of the wiki instead of having a FTP ??!? No. It's just a file. Configuration files are considered pretty reasonable and reliable by a lot of people. ::shrugs:: In any case… It's Free Software, submit patches. Cheers. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [MediaWiki-CVS] SVN: [62223] trunk/extensions/OggHandler
On Wed, Feb 10, 2010 at 8:38 AM, Tim Starling tstarl...@wikimedia.org wrote: That sounds like it needs a one-line fix in OggHandler::normaliseParams(), not 50 lines of code and a new decoder. Do you have a test file or a bug report or something? Just switching the thumbnailer should be sufficient, I agree the pile of code and the retries were fairly lame (and I think I complained about it on IRC). I'm not sure why any support for thumbnailing ogv's with ffmpeg was retained. I don't see how you can fix it in normalizeParams call unless you've scanned the stream and know where the keyframes are. Ffmpeg could be fixed, of course, but the ogg demuxer basically needs a rewrite... I think the patch you did to ffmpeg a while back was a lot better than the code they ultimately included. Here is a file that won't thumbnail under the current code: http://myrandomnode.dyndns.org:8080/~gmaxwell/theora/only_one_keyframe.ogv ffmpeg -y -ss 5 -i only_one_keyframe.ogv -f mjpeg -an -vframes 1 foo.jpeg throws a pile of errors then foo.jpeg is a zero byte file. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [MediaWiki-CVS] SVN: [62223] trunk/extensions/OggHandler
On Wed, Feb 10, 2010 at 9:36 PM, Tim Starling tstarl...@wikimedia.org wrote: Gregory Maxwell wrote: Looks like this change removed both the Oggthumb support as well as the code that handles the cases where ffmpeg fails. The usual problem with deploying new solutions for equivalent tasks is that you substitute known issues with unknown ones. I've looked at oggThumb now, I downloaded the latest tarball. Here are some of the ways in which it sucks: So a couple months back mdale suggested using oggThumb, with the old installed ffmpeg was making spaghetti of some files (old ffmpeg not completely implementing the Theora spec) and the new one that (whomever) tried installing mage spaghetti in a different way (failing to thumb because ffmpeg didn't take your seeking patch eons ago). I'd never heard of it, went to look, and recoiled in horror. Then I sent a patch. * Unlike the current version of FFmpeg, it does not implement bisection seeking. It scans the entire input file to find the relevant frames. For an 85MB test file, it was 30 times slower than FFmpeg. Of the issues I raised, seeking was the only I didn't fix. Unfortunately oggvideotools reimplements libogg in C++ so it could use C++ memory management, my patience ran out before I got around to implementing it. If you search the archive you can see how strongly opposed I am to tools that linear scan unnecessarily. But 30x slower on a file that small sounds a bit odd. * The output filename cannot be specified on the command line, it is generated from the input filename. OggHandler uses a -n option for destination path which just gives an error for me. I don't know if it's a patch or an alpha version feature, but it's not documented either way. It's in SVN. After the author of the package applied my patches (on the same day I sent them) Mdale asked if he should delay Wikimedia deployment until the fixes I sent in went in, the author offered to simply do a new release. No one took him up on the author. * It unconditionally writes a progress message to stdout on every frame in the input file. * It unconditionally pollutes stderr with verbose stream metadata information. * It ignores error return values from libtheora functions like th_decode_packetin(), meaning that essentially the *only* thing on stdout/stderr is useless noise. I'm also not especially keen on its rather non-unixy style. Then again, I think C++ is pretty much crap too, so you can see what my opinion is worth. What I can say is that speaking from personal experience the author of this package is friendly, pleasant to work with, and responsive. Though 'submit patches' takes me out of the 'one line fix' I advertised, — sorry, I'd assumed that Mdale had already worked out the operational angles and my only concerns were correct output and not allowing it to be an enormous DOS vector. Cheers. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [MediaWiki-CVS] SVN: [62223] trunk/extensions/OggHandler
On Wed, Feb 10, 2010 at 12:51 AM, tstarl...@svn.wikimedia.org wrote: http://www.mediawiki.org/wiki/Special:Code/MediaWiki/62223 Revision: 62223 Author: tstarling Date: 2010-02-10 05:51:56 + (Wed, 10 Feb 2010) Log Message: --- * In preparation for deployment, revert the bulk of Michael's unreviewed work. Time for review has run out. The code has many obvious problems with it. Comparing against r38714 will give you an idea of which changes I am accepting. Fixes bug 22388. * Removed magic word hook, doesn't do anything useful. * OggPlayer.js still needs some work. Looks like this change removed both the Oggthumb support as well as the code that handles the cases where ffmpeg fails. Ffmpeg will fail to generate a thumb if there is no keyframe in the file after the point in time that you requested a thumb. This was causing a failure to generate thumbs for many files because they are short and only have a single keyframe at the beginning. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Theora video in IE? Use Silverlight!
On Fri, Feb 5, 2010 at 3:47 PM, Aryeh Gregor simetrical+wikil...@gmail.com wrote: On Fri, Feb 5, 2010 at 3:39 PM, David Gerard dger...@gmail.com wrote: This is clever-ish: http://www.atoker.com/blog/2010/02/04/html5-theora-video-codec-for-silverlight/ He says there that this will Just Work on ~40% of Windows boxes. Not bad. Cortado works wherever Java is installed, which is probably quite a lot more machines -- including Safari on Mac, for instance. If we used anything non-Java, it would surely be Flash, which has much greater penetration than Silverlight on all platforms. Yes, Cortado works in more places but there is no reason that BOTH can't be used, extending support to places with silverlight but without Java. Additionally, although cortado will work the Java ~1.1 VM that came with Navigator 4... it's rather slow except in the latest JVMs. I expect that a lot of systems with silverlight are not running an especially modern JVM. Flash isn't something in the running because you still need to be using encumbered media formats to use it... unless you're only playing audio: There are several independent Vorbis implementations for the flash virtual machine, no video codecs yet, and sadly the flash architecture is no where near as nice as the silverlight one for remote-loaded codecs so you have to completely reinvent all the media infrastructure. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Theora video in IE? Use Silverlight!
On Fri, Feb 5, 2010 at 4:58 PM, David Gerard dger...@gmail.com wrote: On 5 February 2010 21:53, Gregory Maxwell gmaxw...@gmail.com wrote: Yes, Cortado works in more places but there is no reason that BOTH can't be used, extending support to places with silverlight but without Java. The thirty-second startup time of Java for Cortado makes it unusable, in my experience. Here's to Firefox 3.5. Geesh. What JVM is this? I just stop-watched it here on http://myrandomnode.dyndns.org:8080/~gmaxwell/cortest/cortest1.html and timed a bit over 3 seconds... fresh browser reload, no prior java applets run, random 1.6ghz x86_64 laptop, and whatever JVM fedora 12 shipped with. But yes, I'd hope and expect the silverlight stuff to load faster. Indeed. What's the performance of the Flash ActiveScript Theora decoder like? Horrible, or just bad? I'm guessing you meant Vorbis, as there is no Theora port. I've not benchmarked it, but it's supposedly a significant multiple of realtime, but I think significant is something like 10x, which doesn't bode well for a video codec implementation. The testing I did with the C-flash compiler on another audio codec convinced me that it could be made to work... though the performance may not ultimately be satisfactory. (e.g. it may only work acceptably on fast computers, at low resolutions, etc). Although the flash vm might be a lot faster by the time its done. I think it is somewhat moot to speculate on it when it doesn't exist and, as far as I know, no one is actively working on it. In other news, There is some progress being made on an installable video native code tag for IE. (http://cristianadam.blogspot.com/2010/01/ie-tag.html there should be some more news on this in a few days) On Fri, Feb 5, 2010 at 5:03 PM, Gerard Meijssen gerard.meijs...@gmail.com wrote: Hoi, Providing support for Silverlight means that it needs to be tested tp ensure that the support remains stable. Silverlight does not really add value as far as I understand it. It competes with more open standards so reasons can be easily found not to support it. We have to invest in supporting Silverlight, the question is, how does it help us, our readers. We have a reputation that we support open standards ... so how open is Silverlight ? David's post isn't about supporting silverlight it's about (ab)using it to shim in support for open formats for IE users. The current video infrastructure supports a half dozen different modes of playback, maintaining one more would be work, but I think it would have a decent value especially compared to some of the ones already there (VLC plugin? oy) As far as openness goes, see http://en.wikipedia.org/wiki/Novell_Moonlight But I think it's quite reasonable to have different expectations for a technology used as an openness shim. For example, using flash normally has the effect of promoting a proprietary-web but if you use flash only as a canvas replacement for IE users it has a neutral or the opposite long term effect. To the best of my ability to tell, Silverlight is in a much stronger openness position than Flash is, for whatever thats worth. Microsoft has been rather giving and inclusive in this particular bid for world domination. ;) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Flattening a wikimedia category
On Thu, Feb 4, 2010 at 6:40 PM, Tim Landscheidt t...@tim-landscheidt.de wrote: Is there any reason not to have a flatted structure some- where on the toolserver (or, in the long run, in MediaWiki)? A quick look at recentchanges for dewp shows about 22000 changes per month, about one every two minutes. With about 8 categories in all, it should be feasible to up- date the structure incrementally, with daily/weekly/monthly clean new full dumps (or even dispense with up-to-the-se- cond data and just dump the flat structure hourly). Incremental updates for a 'flattened copy' aren't especially realistic... as one user operation can produce millions of operations on the server. I won't bother saying much more, Daniel Schwen pretty much speaks for my view. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Facebook introducing PHP compiler?
On Tue, Feb 2, 2010 at 1:22 PM, Tei oscar.vi...@gmail.com wrote: I was thinking about that the other day, I understand why MediaWiki don't follow that route. Mediawiki often runs in enviroments where users have no shell access, no ability to install extensions, etc. There is some C++ stuff for mediawiki, such as wikidiff3 but its optional. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Google phases out support for IE6
On Mon, Feb 1, 2010 at 1:28 PM, David Gerard dger...@gmail.com wrote: On 1 February 2010 15:43, Aryeh Gregor simetrical+wikil...@gmail.com wrote: On Mon, Feb 1, 2010 at 10:14 AM, Thomas Dalton thomas.dal...@gmail.com wrote: It's not just the clutter, though, it's the effort of maintaining it. I don't suggest we maintain it. Just leave it alone. If other changes happen to cause IE5 to break, then remove it, but don't remove *existing* IE5 support as long as IE5 still happens to work with no extra effort on our part. Yes. If someone actually notices something bitrotting and they tell us, that's excellent. If they don't, there you go. That said, there must be *someone* on this list bloody-minded enough to test Wikipedia in every possible browser and file bugs and patches accordingly ... It shouldn't be a question of bloddy-mindedness. The rotting of support for a single browser version would potential shut out many tens of thousands of users. It's something worth dedicating some resources to. Simply verifying functionality with all the *popular* browsers and platforms is already burdensome. Doing it well (and consistently) requires some infrastructure, such as a collection of virtualized client machines. Once that kind of infrastructure is in place and well oiled the marginal cost of adding a few more test cases should not be especially great. The core of Wikipedia functionality is plain text with a smattering of images in common formats. I can think of no reason that this basic reading functionality for IE 5.x and the like should go away for the foreseeable future but if nothing else, knowing that it doesn't work would be a good thing. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Google phases out support for IE6
On Mon, Feb 1, 2010 at 6:31 PM, Schneelocke schneelo...@gmail.com wrote: Maybe we should do the same - introduce bugs that will cause subtle breakages on browsers we'd rather not go out of our way to specifically support any longer, and see if anyone'll actually complain. :) People are really bad at complaining, especially web users. We've had prolonged obvious glitches which must have effected hundreds of thousands of people and maybe we get a couple of reports. Users appear to just hit the back button and move on, either they don't care at all or they do care but assume it will be fixed without their intervention. What you propose is not a good policy, at least not in this application space. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Google phases out support for IE6
On Sun, Jan 31, 2010 at 6:34 PM, John Vandenberg jay...@gmail.com wrote: Even then, there is http://www.askvg.com/download-mozilla-firefox-30-portable-edition-no-installation-needed/ Excuse me? please read the earlier posts in this thread. I am talking about IE for Mac Classic. iCab support? Is Classilla a sensible replacement for people still using IE for Mac? etc. I couldn't get classzilla running on a blue and white G3 running 9.0.2 when I tried it a couple months ago. I have a couple of these systems for driving some embedded hardware that never got moved to anything more modern, they'd be perfectly adequate systems for webbrowsing if you could get a workably up to date webbrowser on them: The IE the OS ships with hard locks the machine on apple.com of all places! I was only bothering to attempt this because I wanted to get a screenshot of cortado playing videos on something very old, and I only spent an hour or so on it. (Wikipedia, OTOH, worked fine with the IE that comes with the OS on those systems) But seriously. Outright *excluding* these old things shouldn't even be a consideration. Even a very small audience (like 0.02%) is tens of thousands of readers. Mediawiki (and the WMF deployment) already has many features which don't work / don't work well on fairly old systems, so that bridge has already been crossed, but outright dropping support for basic use? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Log of failed searches
On Thu, Jan 14, 2010 at 10:47 AM, Magnus Manske magnusman...@googlemail.com wrote: Suggestion : * log search and SHA1 IP hash (anonymous!) *Any* mapping of the IP is not anonymous. Please see the AOL search results where unique IDs were connected between searches to disclose information. (More over a straight simple hash of an IP can be reversed simply by making a table of all expected IPs) However: Since this is just for internal logging there is no need to hash the IP. Just log it directly, and thus avoid the risk that someone later will think the hash is something which can be disclosed. * search queries are logged in a standardized fashion (for grouping), e.g. lowercase, single spaces, no leading/trailing spaces, special chars converted to spaces, etc. Excellent. * display searches per week (?) that have been searched for at least 10 times from at least 5 different IP hashes (to avoid people searching their own name 100 times...) What I've suggested elsewhere was at least 4 different IPs, 5 sounds fine to me too. I don't know that the minimum of 10 queries matters once the 5 IP check is in place. Per week would be okay. No shorter though. If someone gives me a log format, I'll gladly write a fast tool for producing this output. (I did something like that before where I gave Brion a tool to produce stats from access logs) I think I have a C code for a parser for wikimedia's squid logs... so if its just that I already have a good chunk of it done. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Log of failed searches
On Thu, Jan 14, 2010 at 11:01 AM, David Gerard dger...@gmail.com wrote: 2010/1/14 Bryan Tong Minh bryan.tongm...@gmail.com: On Thu, Jan 14, 2010 at 4:47 PM, Magnus Manske magnusman...@googlemail.com wrote: * log search and SHA1 IP hash (anonymous!) There are only 2 billion unique addresses and they can all be found in half an hour probably. A count of search terms, with no IP info at all? Would be more useful than nothing. (modulo the issue Michael Snow raised re: searches on suppressable names) Magnus was not suggesting disclosing the IP hash, as far as I can tell. He demonstrating an abundance of caution in suggesting only logging that. (er, well, yea, if he was suggesting disclosing that... we shouldn't do that. Even if we add a secret to the hash, it's risky and allows interesting correlation attacks) Here is what I would suggest disclosing: #start_datetime end_datetime hits search_string 2010-01-01-0:0:4 2010-01-13-23-59-50 39284 naked people 2010-01-01-0:0:4 2010-01-13-23-59-50 23950 hot grits ... 2010-01-01-0:0:4 2010-01-13-23-59-50 5 autoerotic quantum chromodynamics Which has first been filtered by: * Canonicalization of strings (at least ascii case folding) * Excluding strings over some length * Excluding searches which did not come from at least 5 distinct IPs during the reporting interval There will be useful information excluded by this process, e.g. gads of misspellings which came from only two to four unique IPs... but the output would still be *far* more useful no information at all. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Log of failed searches
On Thu, Jan 14, 2010 at 11:15 AM, Gregory Maxwell gmaxw...@gmail.com wrote: Here is what I would suggest disclosing: #start_datetime end_datetime hits search_string 2010-01-01-0:0:4 2010-01-13-23-59-50 39284 naked people 2010-01-01-0:0:4 2010-01-13-23-59-50 23950 hot grits ... 2010-01-01-0:0:4 2010-01-13-23-59-50 5 autoerotic quantum chromodynamics The logs are probably combined across wikis, so I'd change that to #start_datetime end_datetime projectcode hits search_string 2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikipedia 39284 naked people 2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikipedia 23950 hot grits ... 2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikipedia 5 autoerotic quantum chromodynamics 2010-01-01-0:0:4 2010-01-13-23-59-50 de.wikipedia 25093 Bondage Disziplin Pokémon ... ... 2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikinews 5 ethics in journalism ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Log of failed searches
On Thu, Jan 14, 2010 at 12:22 PM, Conrad Irwin conrad.ir...@googlemail.com wrote: Wiktionary is case-sensitive and so case-folding there may not be appropriate; I personally would be interested in seeing these logs before even the NFC normalizers get to them (given a lack of any other source to find out how people type fun characters in the wild) though I can appreciate this is somewhat sadistic, and probably the logs are taken too late for this. It would not be too much work to publish a set of post-processing scripts that could perform those normalisations that people are interested in; I don't think any two people will agree exactly on what You've missed the point of the normalization here. It's not to be helpful to users: As you observe, it's easy for the recipient of the list to perform their own. The reason to normalize is to push more queries above the reporting threshold. For example, 5 people might search for john f. kinndey (a misspelling of John F. Kennedy?) but all capitalize it differently. A redirect on this misspelling would be useful regardless of the case. All things equal I'd rather *not* normalize the data... it's just more stuff that may have surprising behaviour. But I think this is something which may need to be balanced against the disclosure threshold. It would also be possible to do the disclosure calculation against normalized data while releasing the raw values... but I must admit a little bit of uneasiness that the normalization might be ignoring some piece of information relevant to privacy. For example, if we were to go that route we might employ some fairly aggressive normalization... removing all whitespace and punctuation. If we went as far as also removing all *numbers* from the check we'd run into things like Greg Maxwell (555)-555-1212 getting published because enough distinct people searched for greg maxwell. Obviously the answer to that one is don't remove numbers from the check, but I worry about the cases I haven't thought of. On Thu, Jan 14, 2010 at 12:51 PM, Aryeh Gregor simetrical+wikil...@gmail.com wrote: Some people might search for their own name more than five times in a week, possibly together with other embarrassing or incriminating search terms. Yes, it's possible that someone may search 5 times, from 5 IPs (which *might* be from one machine due to proxy round-Robbins), an identical string ... MyFullName seen on friday night with a woman other than his wife ... but what to do? Any information which is disclosed has some risk of disclosing something that someone would rather not be. This risk can be made arbitrarily small, but it can't be eliminated. I think the benefit to the readers of having this information available easily outweighs some sufficiently fringe confidentiality concern. At some point your frequently repeated search is a statistic, which no reasonable privacy policy would frown on disclosing. This is important to our operations, disclosing it is in the public interest, and failing to do work in this area puts us at a disadvantage compared to other parties who might be far less scrupulous. (e.g. If WMF's search performs poorly, you might feel compelled to use Search Engine X — which happens to secretly sell your data to the highest bidder.) Is there some sufficiently high number which *no one* paying attention here has a concern about? We could simply start with that and possibly lower the threshold over time as the lowest hanging fruit are solved, tracking our disclosure comfort. I think we all have an interest and obligation to take every reasonable means, but no one can ask for more than that. Would anyone feel more comfortable if this ignored queries made via the secure server? Non-HTTPS traffic can be watched by anyone on the path between you and Wikimedia... any illusion of absolute privacy on the insecure traffic is patently false already. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Log of failed searches
On Thu, Jan 14, 2010 at 6:32 PM, Platonides platoni...@gmail.com wrote: Sampled search logs are unlikely to reveal them though, since what they are repeating are the non-keywords, not the full query. Sampling is fine, but aggregated logs aren't likely to… thats the primary reason for reporting things other than the topmost queries. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Wikimedia crosses 10Gbit/sec
Today Wikimedia's world-wide five-minute-average transmission rate crossed 10gbit/sec for the first time ever, as far as I know. This peak rate was achieved while serving roughly 91,725 requests per second. This fantastic news is almost coincident with Wikipedia's 9th anniversary on January 15th. [http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Day ] In casual units, a rate of 10gbit/sec is roughly equivalent to 5 of the US Library of Congress per day (using the common 1 LoC = 20 TiB units). Wikimedia's 24 hour average transmission rate is now over 5.4gbit/sec, or 2.6 US LoC/day. A snapshot of the traffic graph on this historic day can be seen here: http://commons.wikimedia.org/wiki/File:2010-01-11_wikimedia_crosses_10gbit.png Ten years ago many traditional information sources were turning electronic, and possibly locking out the unlimited use previously enjoyed by public libraries. It seemed to me that closed pay-per-use electronic databases would soon dominate all other sources of factual information. At the same time, the public seemed to be losing much of its interest in the more intellectually active activities such as reading. So if someone told me then that within the decade one of the most popular websites in the world would be a free content encyclopedia, consisting primarily of text, or that the world would soon be consuming over 50 terabytes of compressed educational material per day—I never would have believed them. The growth and success of the Wikimedia projects is an amazing accomplishment, both for the staff and volunteers keeping the infrastructure operating efficiently as well as the tens of thousands of volunteers contributing this amazing corpus. This success affirms the importance of intellectual endeavours in our daily lives and demonstrates the awesome power of people working together towards a common goal. Congratulations to you all. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] IE8 Compatibility View
On Mon, Jan 11, 2010 at 7:18 PM, Aryeh Gregor simetrical+wikil...@gmail.com wrote: On Mon, Jan 11, 2010 at 6:36 PM, Mike.lifeguard mike.lifegu...@gmail.com wrote: Microsoft has informed us with an email to OTRS (#201000039819) that wikimedia.org (and presumably our other domains) will be removed from Why would you presume that? the Compatibility View List for Internet Explorer 8 near the end of January 2010. I don't know why we were ever on it. We always marked our IE7 fixes with if IE 7 and not if IE gt 7, right? IE8 should have been getting good CSS2.1 from the get-go. http://www.microsoft.com/downloads/details.aspx?familyid=B885E621-91B7-432D-8175-A745B87D2588displaylang=en There is an XLS file here indicating that wikimedia.org is pending removal, but the other domains are not. (The email appears to be wikimedia.org specific). Someone should probably take all the WMF domains on that list and request that they all be removed. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] search ranking
On Sun, Jan 10, 2010 at 5:50 PM, Robert Stojnic rainma...@gmail.com wrote: So we got some new search servers (thanks MarkRob) and I have deployed them today. As a consequence, the search limit is now re-raised to 500 and interwiki search is back on all wikis. I would still however like to keep srmax on 50 for API because there seems to be quite a number of broken bots and people experimenting... Additionally, I've switched mwsuggest to lucene backend, so now the AJAX suggestions are no longer alphabetical but ranked according to number of links to them (and some CamelCase and such redirects are not shown). This has been active on en.wp for a while, but now it's on all wikis. If you see things broken please find me on IRC, or leave a message on my en.wp talk page. If anyone feels adventurous: http://www.joachims.org/publications/joachims_02c.pdf http://www.cs.cornell.edu/People/tj/svm_light/svm_rank.html ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] search ranking
On Sun, Jan 10, 2010 at 9:52 PM, William Pietri will...@scissor.com wrote: On 01/10/2010 06:12 PM, Gregory Maxwell wrote: If anyone feels adventurous: http://www.joachims.org/publications/joachims_02c.pdf http://www.cs.cornell.edu/People/tj/svm_light/svm_rank.html Ooh, that looks fun. If I wanted to investigate, I'd start here, yes? http://svn.wikimedia.org/svnroot/mediawiki/branches/lucene-search-2.1/ Is the click data available, too? It's not— but progress on this subject would probably be a good justification for making some available. Without the click data available, I'd suggest simply using the stats.grok.se page view data: It won't allow the system to learn how preferences change as a function of query text, but it would let you try out all the machinery. I'd expect that static page popularity would be the obvious fill-in data you'd use where click through information is not available, in any case. So for example, If query X returns A,B,C,D,E and you only know the user clicked B then you can assume that B[A,C,D,E], but by mixing in the static popularity you can could also decide that BDEAC (because d,e,a,c is the popularity of the remaining pages). In order to use this kind of predictive modelling you need to create some feature extraction. Basically, you take your input and convert it into a feature-vector: a multidimensional value which represents the input as a finite set of floating point numbers which (hopefully) exposes relevant information and ignores irrelevant information. I've never used rank-svm before, but for text classification with SVM it is pretty common to use the presence of words to construct a sparse vector. E.g. after stripping out markup every input word (or work pair, or word fragment or...) gets assigned a dimension. The vector for a text has the value 1. in that dimension if the text contains the word, 0 if it doesn't. So, the blue cat might be [14:1.0 258:1.0 982:1.0], presuming that the was assigned dimension 14, blue 258, cat 982. The zillion other possible dimensions are zero. Typical linear SVM classifiers work reasonable well on highly sparse data like this, even if there are hundreds of thousands of dimensions. Full text indexers like lucene also do basically the same kind thing internally, usually after some folding/stemming (i.e. [girls,gals,dames,female,lady,girl,womens] - women) and elimination of common words (e.g. the), so the lucene tools may already be doing most or all of the work you'd need for basic feature extraction. It looks like for this rank SVM I'd run the feature-extraction on both the query and the article and combine them into one vector for the SVM. For example, you could do something like assign a different value for the word dimension (i.e. 2 if a word is in both vectors, -1 if its in the query but not the article, 0.5 if its only in the article... etc), or give query-words different dimension values than article words (i.e. if you're tracking 100,000 words, add 100,000 to the query word dimension numbers).I have no clue which of the infinite possible ways would work best, there may be some suggestions in the literature but there is no replacement for simply trying a lot of approaches. 95% of the magic in making machine learning work well is coming up with good feature extraction. For Wikipedia data in addition to the word-existence metric which is often used for free-text the presence of categories (i.e. each cat is mapped to a dimension number), and link structure information (perhaps different values for words which are linked?, only using wikilinked words as the article keys) are obvious things which could be added. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
On Fri, Jan 8, 2010 at 10:56 AM, Aryeh Gregor simetrical+wikil...@gmail.com wrote: On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmor...@shaw.ca wrote: I am not sure about the cost of the bandwidth, but the wikipedia image dumps are no longer available on the wikipedia dump anyway. I am guessing they were removed partly because of the bandwidth cost, or else image licensing issues perhaps. I think we just don't have infrastructure set up to dump images. I'm very sure bandwidth is not an issue -- the number of people with a Correct. The space wasn't available for the required intermediate cop(y|ies). terabyte (or is it more?) handy that they want to download a Wikipedia image dump to will be vanishingly small compared to normal users. s/terabyte/several terabytes/ My copy is not up to date, but it's not smaller than 4. Licensing wouldn't be an issue for Commons, at least, as long as it's easy to link the images up to their license pages. (I imagine it would technically violate some licenses, but no one would probably worry about it.) We also dump the licensing information. If we can lawfully put the images on website then we can also distribute them in dump form. There is and can be no licensing problem. Wikipedia uses an average of multiple gigabits per second of bandwidth, as I recall. http://www.nedworks.org/~mark/reqstats/trafficstats-daily.png Though only this part is paid for: http://www.nedworks.org/~mark/reqstats/transitstats-daily.png The rest is peering, etc. which is only paid for in the form of equipment, port fees, and operational costs. The sensible bandwidth-saving way to do it would be to set up an rsync daemon on the image servers, and let people use that. This was how I maintained a running mirror for a considerable time. Unfortunately the process broke when WMF ran out of space and needed to switch servers. On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmor...@shaw.ca wrote: Bittorrent is simply a more efficient method to distribute files, No. In a very real absolute sense bittorrent is considerably less efficient than other means. Bittorrent moves more of the outbound traffic to the edges of the network where the real cost per gbit/sec is much greater than at major datacenters, because a megabit on a low speed link is more costly than a megabit on a high speed link and a megabit on 1 mile of fiber is more expensive than a megabit on 10 feet of fiber. More over, bittorrent is topology unaware so the path length tends to approach the internet average mean path length. Datacenters tend to be more centrally located topology wise, and topology aware distribution is easily applied to centralized stores. (E.g. WMF satisfies requests from Europe in europe, though not for the dump downloads as there simply isn't enough traffic to justify it) Bittorrent also is a more complicated, higher overhead service which requires more memory and more disk IO than traditional transfer mechanisms. There are certainly cases where bittorrent is valuable, such as the flash mob case of a new OS release. This really isn't one of those cases. On Thu, Jan 7, 2010 at 11:52 AM, William Pietri will...@scissor.com wrote: On 01/07/2010 01:40 AM, Jamie Morken wrote: I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. [...] Is the bandwidth used really a big problem? Bandwidth is pretty cheap these days, and given Wikipedia's total draw, I suspect the occasional dump download isn't much of a problem. Bittorrent's real strength is when a lot of people want to download the same thing at once. E.g., when a new Ubuntu release comes out. Since Bittorrent requires all downloaders to be uploaders, it turns the flood of users into a benefit. But unless somebody has stats otherwise, I'd guess that isn't the problem here. We tried BT for the commons poty archive once while I was watching and we never had a downloader stay connected long enough to help another downloader... and that was only 500mb, much easier to seed. BT also makes the server costs a lot higher: it has more cpu/memory overhead, and creates a lot of random disk IO. For low volume large files it's often not much of a win. I haven't seen the numbers for a long time, but when I last looked download.wikimedia.org was producing fairly little traffic... and much of what it was producing was outside of the peak busy hour for the sites. Since the transit is paid for on the 95th percentile and the WMF still has a decent day/night swing out of peak traffic is effectively free. The bandwidth is nothing to worry about. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org
Re: [Wikitech-l] Redirect disclosure on hover
On Sun, Dec 6, 2009 at 6:56 PM, John Doe phoenixoverr...@gmail.com wrote: or a simpler method would be to use a javascript tool like I use which was created by lupin called popups which can actually get the redirect target page show the first picture and first paragraph on mouse hover You have a weird definition of simpler. :) Thousands of lines of JS code and an additional HTTP request per link isn't simple in my book. :) Though the popups tool does provide a number of other advantages which justify its load and complexity... are you aware of any large mediawiki install which have this tool activated by default (i.e. for anons?) On Sun, Dec 6, 2009 at 6:45 PM, Aryeh Gregor simetrical+wikil...@gmail.com wrote: Caching is one problem here. Another is that you need to reliably generate the redirected from link somehow, so that redirects are maintainable. You don't want an editor to click a link, arrive at a totally different page (maybe via an inappropriate redirect), and have no idea how they got there. hm. This could be resolved by mixing in the URL stuffing alternative, linking to /target#from_redirectname then letting client side JS code generate the redirect back-link. The job queue is already horribly overloaded, I don't think adding more things to it would be a good thing. Right… though it's not especially harmful if this information is stale so there is the possibility of simply letting it be stale. ...but workqueue workload is why I waved my arms about request merging and priority queueing. I'd expect that the actual additional work in fixing up redirect destination changes would be pretty negligible if the entries were handled at a lower priority and eliminated whenever their task was completed as a side effect of some other change. On the other hand, since this particular change doesn't affect anything visible to templates or such, you wouldn't have to reparse the whole page to update it, in principle. [snip] Good point. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Usability initiative (HotCatreplacement/improvements etc.)
On Wed, Sep 16, 2009 at 5:24 PM, Jared Williams jared.willia...@ntlworld.com wrote: Can distribute them across multiple domain names, thereby bypassing the browser/HTTP limits. Something along the lines of 'c'.(crc32($title) 3).'.en.wikipedia.org' Would atleast attempt to download upto 4 times as many things. Right, but it reduces connection reuse. So you end up taking more TCP handshakes and spend more time with a small transmission window. (plus more DNS round-trips; relevant because wikimedia uses low TTLs for GSLB reasons) TNSTAAFL. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Not allowing certain external link types?
On Sat, Sep 5, 2009 at 4:28 PM, David Gerarddger...@gmail.com wrote: Although his actions were IMO dickish, he has some point: is there any reason to allow .exe links on WMF sites? Is there a clean method to disable them? Is this a bad idea for any reason? What should default settings be in MediaWiki itself? etc., etc. http://markmail.org/message/6zsebtdrahmwzs3s What once was rubbish is no more? :) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] flaggedrevs.labs.wikimedia.org Status?
On Tue, Sep 1, 2009 at 5:03 AM, K. Peacheyp858sn...@yahoo.com.au wrote: On Tue, Sep 1, 2009 at 5:39 PM, Gregory Maxwellgmaxw...@gmail.com wrote: Seems my concern was moot in any case... Every time I loaded it I've only seen trashed pages like this: http://flaggedrevs.labs.wikimedia.org/wiki/Super_Smash_Bros._Melee But I guess this is just a result of the import being incomplete. It's not trashed it's just missing templates and possibly css [snip] Um... Read the thread plz. :) And it's different now than it was last night, last night the templates weren't there yet and it looked like a car hit it. :) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] flaggedrevs.labs.wikimedia.org Status?
On Tue, Sep 1, 2009 at 7:17 PM, K. Peacheyp858sn...@yahoo.com.au wrote: On Wed, Sep 2, 2009 at 7:02 AM, Platonidesplatoni...@gmail.com wrote: You know, when you point to a broken page, people^W wikipedians tend to do absurd things like fixing them :) I was going to fix some up, but import is restricted and i was too lazy to do copy/paste imports. Ehhh. It don't know that it makes sense to spend effort manually fixing pages on a test project. If the import procedure is not working right it should be improved... In any case, I'm sorry for the tangent. The main intent of my post was to determine the current status: Is the import finished? When will the configuration changes for flagged protection be turned on? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] flaggedrevs.labs.wikimedia.org Status?
Greetings. Can anyone provide a status update regarding flaggedrevs.labs.wikimedia.org ? In the future perhaps it would be better to import simple english Wikipedia for enwp testing: The lack of templates makes the site look extensively vandalized already. I'm guessing that an alternative english language project would be more useful than a subset of enwp. :) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Wikipedia iPhone app official page?
On Sat, Aug 29, 2009 at 9:07 AM, Dmitriy Sintsovques...@rambler.ru wrote: Some local coder told me that GIT is slower and consumes much more RAM on some operations than SVN. I can't confirm that, though, because I never used GIT and still rarely use SVN. But, be warned. I laughed at this... GIT has a number of negatives, but poor speed is not one of them especially if you're used to working with SVN and a remote server. Maybe this is just a windows issue? GIT leaves a lot of work to the filesystem. My primary complaint with GIT is that if you're doing non-trivial tree manipulation it's not at all difficult to convert your tree into Swiss Cheese and it can be fairly difficult to fix it other than by pulling a copy from an unscrewedup replica and cherry pick your later changes back into it. OTOH, the sorts of tree uber-bonsai likely to result in a shredded tree are pretty much not possible in SVN. YMMV. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Wikipedia iPhone app official page?
On Sat, Aug 29, 2009 at 6:48 PM, Marco Schusterma...@harddisk.is-a-geek.org wrote: And so to the disk. If the disk or the controller sucks or is simply old (not everyone has shiny new hardware), you're also damn slow. What should also not be underestimated is the diskspace demand of a GIT repo - not On most projects I'm working on, even ones with long histories, the git repo is around the same size as a checkout and on many it s smaller. Of course, you'll also need a checkout in order to do useful work with it, but doubling the storage isn't usually a big deal. If you're the sort of person who does development using a whole lot of separate local trees git can use the same storage to provide history for all of them, even when the trees are partially divergent. DVCS is especially useful on a laptop because you can perform useful version control while disconnected from the internet. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] please make wikimedia.org mailing lists searchable
On Mon, Aug 24, 2009 at 1:16 AM, jida...@jidanni.org wrote: Why have each user jump through such hoops, and still leave this door open to the the bad guys whoever they are. [snip] If you wish to have a productive discussion with people you'll be most successful if you try to understand and empathize with their concerns, so that you can find a solution which satisfies everyone. You won't go far with scare-quoted phrases like the bad guys and hyperbole like held for ransom and North Korean style. The current behaviour was established as the result of experience: It's not something that was done speculatively, but as a solution to real problems which were occurring. Removing messages from archives was found to be time-consuming and ineffective because once out the removal often did nothing. The annoying of dealing with it was magnified because it had to be done by someone with shell access and because it was, naturally, always urgent. People make mistakes, both the clicked the wrong button type and the failed to consider the consequence type, and people often play fast and loose with other people's privacy. As an example— an issue we've had in the past is people responding with private details to a message which included a public list buried in its carbon-copy chain. So admonishing be more careful really doesn't solve it: The lack of google indexing is intended to address the cases where be careful failed. The intent isn't to stop people from searching for information in the lists, which would be an impossible goal, but to prevent material from the lists from showing up at the top of google when people perform random searches for various people's names and to make removals actually effective. So the availability of archive files is not a problem. Perhaps this is more of a problem for the Wikimedia Lists than many others due to the high search placement of the Wiki(p|m)edia sites in general. I think the comparison to LKML is entirely inappropriate: not only can you make an entirely different set of assumptions about the users technical prowess but LKML is open for posting to non-subscribers … the level of SPAM received through it in the past has exceeded the volume of some of our lists, its like arguing that we shouldn't wear underwear because the nice folks at the nudist colony don't either. :) Different culture, different issues, different solutions. Other people do have the same problems and concerns— though obviously you're less likely to see them if they aren't indexed by google! Being able to keep your messages out of the search indexes while remaining open to anyone who is willing to click a few buttons is a primary attraction of the yahoo-groups service. Be thankful that we don't force you though an infuriating web interface like they do. I think everyone would like better search than we currently have available. It should be possible to provide a solid search interface without increasing the level of exposure. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] please make wikimedia.org mailing lists searchable
On Sat, Aug 22, 2009 at 11:20 PM, jida...@jidanni.org wrote: All I know is I don't know of any other examples of security through obscurity on mailing lists. Wasn't Jimbo inventing a new search engine? I don't know though... can't search for the announcement. Download the gzipped mbox files from when you were not subscribed, for example http://lists.wikimedia.org/pipermail/foundation-l/2009-July.txt.gz Import this into the client software of your choice. Enjoy your new-found ability to search. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Batik SVG-to-PNG server revisited
On Sun, Aug 16, 2009 at 8:00 PM, Hk knghk@web.de wrote: New test results were added at http://www.mediawiki.org/wiki/SVG_benchmarks This looks even better than my first attempt. Nonetheless, it is clear that batikd is not ready to use but needs to be worked on. I'm not sure where the notion came up that median performance was a useful criteria for selecting a rendering engine. I'd expect that the criteria would be something like this: 0. security comfort (i.e. ability to deny local file access, strength against overflow exploits) 1. worst case memory usage vs average 2. worst case cpu consumption vs average 3. Least surprising rendered output 4. average cpu consumption Batik probably wins on 0, Inkscape wins on 3 (being bug compatible with something the user can operate at home is arguably superior to being correct), rsvg wins on 1,2,4 (and maybe daemonized batik is getting close on 4). Sometimes the CPU comparisons can be a bit hard... a rendering engine which doesn't support SVG filters (i.e. old rsvg) will likely be faster, but it will be producing unexpected output. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Video transcoding settings Was: [54611] trunk/extensions/WikiAtHome/WikiAtHome.php
On Fri, Aug 7, 2009 at 5:29 PM, d...@svn.wikimedia.org wrote: http://www.mediawiki.org/wiki/Special:Code/MediaWiki/54611 Revision: 54611 Author: dale Date: 2009-08-07 21:29:26 + (Fri, 07 Aug 2009) Log Message: --- added a explicit keyframeInterval per gmaxwell's mention on wikitech-l. (I get ffmpeg2theora: unrecognized option `--buf-delay for adding in buf-delay) I thought firefogg was tracking j^'s nightly? If the encoder has two-pass it has --buf-delay. Does firefog perhaps need to be changed to expose it? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Video Quality for Derivatives (was Re:w...@home Extension)
On Thu, Aug 6, 2009 at 8:00 PM, Michael Dalemd...@wikimedia.org wrote: So I committed ~basic~ derivate code support for oggHandler in r54550 (more solid support on the way) Based input from the w...@home thread; here are updated target qualities expressed via the firefogg api to ffmpeg2thoera Not using two-pass on the rate controlled versions? It's a pretty consistent performance improvement[8], and it eliminates the first frame blurry issue that sometimes comes up for talking heads. (Note, that by default two-pass cranks the keyframe interval to 256 and makes the buf-delay infinite. So you'll need to set those to sane values for streaming). [1] For example: http://people.xiph.org/~maikmerten/plots/bbb-68s/managed/psnr.png ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Video Quality for Derivatives (was Re:w...@home Extension)
On Thu, Aug 6, 2009 at 8:17 PM, Gregory Maxwellgmaxw...@gmail.com wrote: On Thu, Aug 6, 2009 at 8:00 PM, Michael Dalemd...@wikimedia.org wrote: So I committed ~basic~ derivate code support for oggHandler in r54550 (more solid support on the way) Based input from the w...@home thread; here are updated target qualities expressed via the firefogg api to ffmpeg2thoera Not using two-pass on the rate controlled versions? It's a pretty consistent performance improvement[8], and it eliminates the first frame blurry issue that sometimes comes up for talking heads. (Note, that by default two-pass cranks the keyframe interval to 256 and makes the buf-delay infinite. So you'll need to set those to sane values for streaming). I see r54562 switching to two-pass, but as-is this will produce files which are not really streamable (because they streams can and will burst to 10mbits even though the overall rate is 500kbit or whatever is requested). We're going to want to do something like -k 64 --buf-delay=256. I'm not sure what key-frame interval we should be using— Longer intervals lead to clearly better compression, with diminishing returns over 512 or so depending on the content... but lower seeking granularity during long spans without keyframes. The ffmpeg2theora defaults are 64 in one-pass mode, 256 in two-pass mode. Buf-delay indicates the amount of buffering the stream is targeting. I.e. For a 30fps stream at 100kbit/sec a buf-delay of 60 means that the encoder expects that the decoder will have buffered at least 200kbit (25kbyte) of video data before playback starts. If the buffer runs dry the playback stalls— pretty crappy for the user's experience. So bigger buff delays either mean a longer buffering time before playback or more risk of stalling. In the above (30,60,100) example the client would require 2 seconds to fill the buffer if they were transferring at 100kbit/sec, 1 second if they are transferring at 200kbit/sec. etc. The default is the same as the keyframe interval (64) in one pass mode, and infinite in two-pass mode. Generally you don't want the buf-delay to be less than the keyframe interval, as quality tanks pretty badly at that setting. Sadly the video tag doesn't currently provide any direct way to request a minimum buffering. Firefox just takes a guess and every time it stalls it guesses more. Currently the guesses are pretty bad in my experience, though this is something we'll hopefully get addressed in future versions. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] How to securely connect to Wikipedia in a public wifi ?
On Tue, Aug 4, 2009 at 7:47 PM, Brion Vibberbr...@wikimedia.org wrote: On 8/3/09 6:28 PM, Remember the dot wrote: On Mon, Aug 3, 2009 at 2:16 PM, Brion Vibberbr...@wikimedia.org wrote: Once we have a cleaner interface for hitting the general pages (without the 'secure.wikimedia.org' crappy single host) I'm curious...what will this cleaner interface look like? Will we be able to connect securely through https://en.wikipedia.org/? That's the idea... This means we need SSL proxies available on all of our front-end proxies instead of just on a dedicated location, and some hoop-jumping to get certificate hostnames to match, but it's not impossible. We did a little experimentation in '07 along these lines but just got busy with other things. :( A useful data point is that greenrea...@wikifur has switched to using protocol relative URLs rather than absolutes (i.e. //host.domain.com/foo/bar) and had good luck with it. This is an additional data-point beyond the testing I did with en.wp last year. (Last year while doing some ipv6 testing I also tested protocol relatives and determined that all the clients with JS support were unharmed by protocol relatives). Ironically— the existence of secure.wikimedia.org with insecure images is the only obstruction I see to switching images on the production sites to protocol relatives in order to confirm client compatibility. (For those following at home: If Wikimedia can use protocol relatives as a global replacement for absolutes to its own domains we can avoid inadvertent secure/insecure mode switching and leaks without having to have two copies of the article cache data and without kludgy on-the-fly rewriting) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] w...@home Extension
On Mon, Aug 3, 2009 at 10:56 PM, Michael Dalemd...@wikimedia.org wrote: Also will hack in adding derivatives to the job queue where oggHandler is embed in a wiki-article at a substantial lower resolution than the source version. Will have it send the high res version until the derivative is created then purge the pages to point to the new location. Will try and have the download link still point to the high res version. (we will only create one or two derivatives... also we should decide if we want an ultra low bitrate (200kbs or so version for people accessing Wikimedia on slow / developing country connections) [snip] So I think there should generally be three versions, a 'very low rate' suitable for streaming for people without excellent broadband, a high rate suitable for streaming on good broadband, and a 'download' copy at full resolution and very high rate. (The download copy would be the file uploaded by the user if they uploaded an Ogg) As a matter of principle we should try to achieve both very high quality and works for as many people as possible. I don't think we need to achieve both with one file, so the high and low rate files could specialize in those areas. The suitable for streaming versions should have a limited instantaneous bitrate (non-infinite buf-delay). This sucks for quality but it's needed if we want streams that don't stall, because video can easily have 50:1 peak to average rates over fairly short time-spans. (It's also part of the secret sauce that differentiates smoothly working video from stuff that only works on uber-broadband). Based on 'what other people do' I'd say the low should be in the 200kbit-300kbit/sec range. Perhaps taking the high up to a megabit? There are also a lot of very short videos on Wikipedia where the whole thing could reasonably be buffered prior to playback. Something I don't have an answer for is what resolutions to use. The low should fit on mobile device screens. Normally I'd suggest setting the size based on the content: Low motion detail oriented video should get higher resolutions than high motion scenes without important details. Doubling the number of derivatives in order to have a large and small setting on a per article basis is probably not acceptable. :( For example— for this (http://people.xiph.org/~greg/video/linux_conf_au_CELT_2.ogv) low motion video 150kbit/sec results in perfectly acceptable quality at a fairly high resolution, while this (http://people.xiph.org/~greg/video/crew_cif_150.ogv) high motion clip looks like complete crap at 150kbit/sec even though it has 25% fewer pixels. For that target rate rhe second clip is much more useful when downsampled: http://people.xiph.org/~greg/video/crew_128_150.ogv yet if the first video were downsampled like that it would be totally useless as you couldn't read any of the slides. I have no clue how to solve this. I don't think the correct behavior could be automatically detected and if we tried we'd just piss off the users. As an aside— downsampled video needs some makeup sharpening like downsampled stills will. I'll work on getting something in ffmpeg2theora to do this. There is also the option of decimating the frame-rate. Going from 30fps to 15fps can make a decent improvement for bitrate vs visual quality but it can make some kinds of video look jerky. (Dropping the frame rate would also be helpful for any CPU starved devices) Something to think of when designing this is that it would be really good to keep track of the encoder version and settings used to produce each derivative, so that files can be regenerated when the preferred settings change or the encoder is improved. It would also make it possible to do quick one-pass transcodes for the rate controlled streams and have the transcoders go back during idle time and produce better two-pass encodes. This brings me to an interesting point about instant gratification: Ogg was intended from day one to be a streaming format. This has pluses and minuses, but one thing we should take advantage of is that it's completely valid and well supported by most software to start playing a file *as soon* as the encoder has started writing it. (If software can't handle this it also can't handle icecast streams). This means that so long as the transcode process is at least realtime the transcodes could be immediately available. This would, however, require that the derivative(s) be written to an accessible location. (and you will likely have to arrange so that a content-length: is not sent for the incomplete file). ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] GIF thumbnailing
On Sun, Aug 2, 2009 at 10:26 AM, Ilmari Karonennos...@vyznev.net wrote: [snip] It seems to me that delivering *static* thumbnails of GIF images, either in GIF or PNG format, would be a considerable improvement over the current situation. And indeed, the code to do that seems to be already in place: just set $wgMaxAnimatedGifArea = 0; So— separate from animation why would you use an gif rather than a PNG? I can think of two reasons: (1) you're making a spacer image and the gif is actually smaller, scaling isn't relevant here (2) you're using gif transparency and are obsessed with compatibility with old IE. Scaling doesn't tend to work really well with binary transparency. In other cases the gif tends to be larger, loads slower, etc. They can be converted to PNG losslessly, so you should probably do so. What am I missing? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Wiki at Home Extension
On Sun, Aug 2, 2009 at 6:29 PM, Michael Dalemd...@wikimedia.org wrote: [snip] two quick points. 1) you don't have to re-upload the whole video just the sha1 or some sort of hash of the assigned chunk. But each re-encoder must download the source material. I agree that uploads aren't much of an issue. [snip] other random clients that are encoding other pieces would make abuse very difficult... at the cost of a few small http requests after the encode is done, and at a cost of slightly more CPU cylces of the computing pool. Is 2x slightly? (Greater because some clients will abort/fail.) Even that leaves open the risk that a single trouble maker will register a few accounts and confirm their own blocks. You can fight that too— but it's an arms race with no end. I have no doubt that the problem can be made tolerably rare— but at what cost? I don't think it's all that acceptable to significantly increase the resources used for the operation of the site just for the sake of pushing the capital and energy costs onto third parties, especially when it appears that the cost to Wikimedia will not decrease (but instead be shifted from equipment cost to bandwidth and developer time). [snip] We need to start exploring the bittorrent integration anyway to distribute the bandwidth cost on the distribution side. So this work would lead us in a good direction as well. http://lists.wikimedia.org/pipermail/wikitech-l/2009-April/042656.html I'm troubled that Wikimedia is suddenly so interested in all these cost externalizations which will dramatically increase the total cost but push those costs off onto (sometimes unwilling) third parties. Tech spending by the Wikimedia Foundation is a fairly small portion of the budget, enough that it has drawn some criticism. Behaving in the most efficient manner is laudable and the WMF has done excellently on this front in the past. Behaving in an inefficient manner in order to externalize costs is, in my view, deplorable and something which should be avoided. Has some organizational problem arisen within Wikimedia which has made it unreasonably difficult to obtain computing resources, but easy to burn bandwidth and development time? I'm struggling to understand why development-intensive externalization measures are being regarded as first choice solutions, and invented ahead of the production deployment of basic functionality. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] w...@home Extension
On Sat, Aug 1, 2009 at 12:13 AM, Michael Dalemd...@wikimedia.org wrote: true... people will never upload to site without instant gratification ( cough youtube cough ) ... Hm? I just tried uploading to youtube and there was a video up right away. Other sizes followed within a minute or two. At any rate its not replacing the firefogg that has instant gratification at point of upload its ~just another option~... As another option— Okay. But video support on the site stinks because of lack of server side 'thumbnailing' for video. People upload multi-megabit videos, which is a good thing for editing, but then they don't play well for most users. Just doing it locally is hard— we've had failed SOC projects for this— doing it distributed has all the local complexity and then some. Also I should add that this w...@home system just gives us distributed transcoding as a bonus side effect ... its real purpose will be to distribute the flattening of edited sequences. So that 1) IE users can view them 2) We can use effects that for the time being are too computationally expensive to render out in real-time in javascript 3) you can download and play the sequences with normal video players and 4) we can transclude sequences and use templates with changes propagating to flattened versions rendered on the w...@home distributed computer I'm confused as to why this isn't being done locally at Wikimedia. Creating some whole distributed thing seems to be trading off something inexpensive (machine cycles) for something there is less supply of— skilled developer time. Processing power is really inexpensive. Some old copy of ffmpeg2theora on a single core of my core2 desktop process a 352x288 input video at around 100mbit/sec (input video consumption rate). Surely the time and cost required to send a bunch of source material to remote hosts is going to offset whatever benefit this offers. We're also creating a whole additional layer of cost in that someone have to police the results. Perhaps my tyler durden reference was too indirect: * Create a new account * splice some penises 30 minutes into some talking head video * extreme lulz. Tracking down these instance and blocking these users seems like it would be a fulltime job for a couple of people and it would only be made worse if the naughtyness could be targeted at particular resolutions or fallbacks. (Making it less likely that clueful people will see the vandalism) While presently many machines in the wikimedia internal server cluster grind away at parsing and rendering html from wiki-text the situation is many orders of magnitude more costly with using transclution and temples with video ... so its good to get this type of extension out in the wild and warmed up for the near future ;) In terms of work per byte of input the wikitext parser is thousands of times slower than the theora encoder. Go go inefficient software. As a result the difference may be less than many would assume. Once you factor in the ratio of video to non-video content for the for-seeable future this comes off looking like a time wasting boondoggle. Unless the basic functionality— like downsampled videos that people can actually play— is created I can't see there ever being a time where some great distributed thing will do any good at all. The segmenting is going to significant harm compression efficiency for any inter-frame coded output format unless you perform a two pass encode with the first past on the server to do keyframe location detection. Because the stream will restart at cut points. also true. Good thing theora-svn now supports two pass encoding :) ... Yea, great, except doing the first pass for segmentation is pretty similar to the computational cost as simply doing a one-pass encode of the video. but an extra key frame every 30 seconds properly wont hurt your compression efficiency too much.. It's not just about keyframes locations— if you encode separately and then merge you lose the ability to provide continuous rate control. So there would be large bitrate spikes at the splice intervals which will stall streaming for anyone without significantly more bandwidth than the clip. vs the gain of having your hour long interview trans-code a hundred times faster than non-distributed conversion. (almost instant gratification) Well tuned you can expect a distributed system to improve throughput at the expense of latency. Sending out source material to a bunch of places, having them crunch on it on whatever slow hardware they have, then sending it back may win on the dollars per throughput front, but I can't see that having good latency. true... You also have to log in to upload to commons It will make life easier and make abuse of the system more difficult.. plus it can Having to create an account does pretty much nothing to discourage malicious activity. act as a motivation factor with distribu...@home teams, personal stats and
Re: [Wikitech-l] w...@home Extension
On Sat, Aug 1, 2009 at 2:54 AM, Brianbrian.min...@colorado.edu wrote: On Sat, Aug 1, 2009 at 12:47 AM, Gregory Maxwell gmaxw...@gmail.com wrote: On Sat, Aug 1, 2009 at 12:13 AM, Michael Dalemd...@wikimedia.org wrote: Once you factor in the ratio of video to non-video content for the for-seeable future this comes off looking like a time wasting boondoggle. I think you vastly underestimate the amount of video that will be uploaded. Michael is right in thinking big and thinking distributed. CPU cycles are not *that* cheap. Really rough back of the napkin numbers: My desktop has a X3360 CPU. You can build systems all day using this processor for $600 (I think I spent $500 on it 6 months ago). There are processors with better price/performance available now, but I can benchmark on this. Commons is getting roughly 172076 uploads per month now across all media types. Scans of single pages, photographs copied from flickr, audio pronouncations, videos, etc. If everyone switched to uploading 15 minute long SD videos instead of other things there would be 154,868,400 seconds of video uploaded to commons per-month. Truly a staggering amount. Assuming a 40 hour work week it would take over 250 people working full time just to *view* all of it. That number is an average rate of 58.9 seconds of video uploaded per second every second of the month. Using all four cores my desktop video encodes at 16x real-time (for moderate motion standard def input using the latest theora 1.1 svn). So you'd need less than four of those systems to keep up with the entire commons upload rate switched to 15 minute videos. Okay, it would be slow at peak hours and you might wish to produce a couple of versions at different resolutions, so multiply that by a couple. This is what I meant by processing being cheap. If the uploads were all compressed at a bitrate of 4mbit/sec and that users were kind enough to spread their uploads out through the day and that the distributed system were perfectly efficient (only need to send one copy of the upload out), and if Wikimedia were only paying $10/mbit/sec/month for transit out of their primary dataceter... we'd find that the bandwidth costs of sending that source material out again would be $2356/month. (58.9 seconds per second * 4mbit/sec * $10/mbit/sec/month) (Since transit billing is on the 95th percentile 5 minute average of the greater of inbound or outbound uploads are basically free, but sending out data to the 'cloud' costs like anything else). So under these assumptions sending out compressed video for re-encoding is likely to cost roughly as much *each month* as the hardware for local transcoding. ... and the pace of processing speed up seems to be significantly better than the declining prices for bandwidth. This is also what I meant by processing being cheap. Because uploads won't be uniformly space you'll need some extra resources to keep things from getting bogged at peak hours. But the poor peak-to-average ratio also works against the bandwidth costs. You can't win: Unless you assume that uploads are going to be very low bitrates local transcoding will always be cheaper with very short payoff times. I don't know how to figure out how much it would 'cost' to have human contributors spot embedded penises snuck into transcodes and then figure out which of several contributing transcoders are doing it and blocking them, only to have the bad user switch IPs and begin again. ... but it seems impossibly expensive even though it's not an actual dollar cost. There is a lot of free video out there and as soon as we have a stable system in place wikimedians are going to have a heyday uploading it to Commons. I'm not saying that there won't be video; I'm saying there won't be video if development time is spent on fanciful features rather than desperately needed short term functionality. We have tens of thousands of videos, much of which don't stream well for most people because they need thumbnailing. Firefogg was useful upload lubrication. But user-powered cloud transcoding? I believe the analysis I provided above demonstrates that resources would be better applied elsewhere. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] w...@home Extension
On Sat, Aug 1, 2009 at 12:17 PM, Brianbrian.min...@colorado.edu wrote: A reasonable estimate would require knowledge of how much free video can be automatically acquired, it's metadata automatically parsed and then automatically uploaded to commons. I am aware of some massive archives of free content video. Current estimates based on images do not necessarily apply to video, especially as we are just entering a video-aware era of the internet. At any rate, while Gerard's estimate is a bit optimistic in my view, it seems realistic for the near term. So— The plan is that we'll lose money on every transaction but we'll make it up in volume? (Again, this time without math: The rate of increase as a function of video-minutes of the amortized hardware costs costs for local transcoding is lower than the rate of increase in bandwidth costs needed to send off the source material to users to transcode in a distributed manner. This holds for pretty much any reasonable source bitrate, though I used 4mbit/sec in my calculaton. So regardless of the amount of video being uploaded using users is simply more expensive than doing it locally) Existing distributed computing projects work because the ratio of CPU-crunching to communicating is enormously high. This isn't (and shouldn't be) true for video transcoding. They also work because there is little reward for tampering with the system. I don't think this is true for our transcoding. There are many who would be greatly gratified by splicing penises into streams far more so than anonymously and undetectably making a protein fold wrong. ... and it's only reasonable to expect the cost gap to widen. On Sat, Aug 1, 2009 at 9:57 AM, David Gerarddger...@gmail.com wrote: Oh hell yes. If I could just upload any AVI or MPEG4 straight off a camera, you bet I would. Just imagine what people who've never heard the word Theora will do. Sweet! Except, *instead* of developing the ability to upload straight off a camera what is being developed is user-distributed video transcoding— which won't do anything itself to make it easier to upload. What it will do is waste precious development cycles maintaining an overly complicated software infrastructure, waste precious commons administration cycles hunting subtle and confusing sources of vandalism, and waste income from donors by spending more on additional outbound bandwidth than would be spent on computing resources to transcode locally. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] w...@home Extension
On Sat, Aug 1, 2009 at 1:13 PM, Brianbrian.min...@colorado.edu wrote: There are always tradeoffs. If I understand w...@home correctly it is also intended to be run @foundation. It works just as well for distributing transcoding over the foundation cluster as it does for distributing it to disparate clients. There is nothing in the source code that suggests that. It currently requires the compute nodes to be running the firefogg browser extension. So this would require loading an xserver and firefox onto the servers in order to have them participate as it is now. The video data has to take a round-trip through PHP and the upload interface which doesn't really make any sense, that alone could well take as much time as the actual transcode. As a server distribution infrastructure it would be an inefficient one. Much of the code in the extension appears to be there to handle issues that simply wouldn't exist in the local transcoding case. I would have no objection to a transcoding system designed for local operation with some consideration made for adding externally distributed operation in the future if it ever made sense. Incidentally— The slice and recombine approach using oggCat in WikiAtHome produces files with gaps in the granpos numbering and audio desync for me. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Alternative editing interfaces using write API (was: Re: Watchlistr.com, an outside site that asks for Wikimedia passwords)
On Wed, Jul 22, 2009 at 10:05 PM, Brianna Laugherbrianna.laug...@gmail.com wrote: [snip] I can imagine someone building an alternative edit interface for a subset of Wikipedia content, say a WikiProject. Then the interface can strip away all the general crud and just provide information relevant to that topic area. Sweet. I look forward to the bright future where I can create an enhanced AJAX edit-box for MediaWiki then throw it up with a bunch of ads and private-data-collection and avoid the pesky problem of open sourcing my code and contributing it back to the MediaWiki codebase in order to get it widely used. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Do no harm
On Thu, Jul 23, 2009 at 11:07 AM, dan nessettdness...@yahoo.com wrote: [snip] On the other hand, if there were regression tests for the main code and for the most important extensions, I could make the change, run the regression tests and see if any break. If some do, I could focus my attention on those problems. I would not have to find every place the global is referenced and see if the change adversely affects the logic. This only holds if the regression test would fail as a result of the change. This is far from a given for many changes and many common tests. Not to mention the practical complications— many extensions have complicated configuration and/or external dependencies. make test_all_extensions is not especially realistic. Automated tests are good, necessary even, but they don't relieve you of the burden of directly evaluating the impact of a broad change. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Clickjacking and CSRF
On Wed, Jul 22, 2009 at 12:54 PM, Aryeh Gregorsimetrical+wikil...@gmail.com wrote: Well, in this case we're not even talking about something that would go into HTML 5, necessarily, it's being developed by only Mozilla right now. If more important Wikimedia people than I state agreement with me about the importance of the feature to easy CSP deployment, I think that will be more useful than flaming anyone. Or if they disagree, they should say so so I don't mislead the Mozilla people into thinking the feature needs to be added to the spec. [snip] This point is worth saying twice. If some minor tweak (like a monitor but not enforce mode) is necessary and sufficient for the Mediawiki core devs to commit to using the feature (and for Wikimedia to roll it on Wikipedia) then that should carry significant weight for both the implemetors and whatwg as a whole. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Watchlistr.com, an outside site that asks for Wikimedia passwords
On Wed, Jul 22, 2009 at 4:18 PM, David Gerarddger...@gmail.com wrote: Mmm. So solving this properly would require solving many of the various consolidated/multiple watchlist bugs in MediaWiki itself, then. Hm? No. Solving *this* involves having a sysadmin determine the source of IP of the remote logins and scrambling the password of every account which has logged in through it. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] just a note...
On Sat, Jul 11, 2009 at 6:13 PM, Domas Mituzasmidom.li...@gmail.com wrote: Could you elaborate on what template and why changing a single template should have that large an effect? tomorrow =) I'm guessing something that added some categories to some very widely used infobox or licensing templates. Do I win? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Recommending a browser for video (was: Proposal: switch to HTML 5)
On Thu, Jul 9, 2009 at 5:23 PM, David Gerarddger...@gmail.com wrote: 2009/7/9 Platonides platoni...@gmail.com: I advocate a simply: You can [[install X]] to get native support. [[More info]] What do we do for iPhone users? They do not have Theora support because Apple has actively decided it will not support it; we can either appear to be defective, or we can correctly assign responsibility. I assume Apple is not ashamed of their decision to exclude Theora. Obviously the solution is to send the user to instructions on how to jailbreak their iphone and install theora support. Duh. ;) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Recommending a browser for video (was: Proposal: switch to HTML 5)
On Thu, Jul 9, 2009 at 6:20 PM, David Gerarddger...@gmail.com wrote: 2009/7/9 Aryeh Gregor simetrical+wikil...@gmail.com: Assuming that native support really is noticeably better. Maybe we could only suggest it if we detect that the playback is stuttering, or suggest it more prominently if we detect that. I assume Cortado can detect that. Are there noticeable advantages to native playback other than better performance? Yes: not waiting thirty seconds for Java to start up. 10 of which your browser pretending to be crashed in many cases. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] secure slower and slower
On Wed, Jul 8, 2009 at 9:05 AM, David Gerarddger...@gmail.com wrote: 2009/7/7 Aryeh Gregor simetrical+wikil...@gmail.com: But really -- have there been *any* confirmed incidents of MITMing an Internet connection in, say, the past decade? Real malicious attacks in the wild, not proof-of-concepts or white-hat experimentation? I'd imagine so, but for all people emphasize SSL, I can't think of any specific case I've heard of, ever. It's not something normal people need to worry much about, least of all for Wikipedia. Nope. The SSL threat model is completely arse-backwards. It assumes secure endpoints and a vulnerable network. Whereas what we see in practice is Trojaned endpoints and no-one much bothering with the network. Actually, there is a lot of screwing with the network. For instance, take the UK service providers surreptitiously modifying Wikipedia's responses on the fly to create a fake 404 when you hit particular articles. I believe it's a common practice for US service providers to sell information feeds about user's browsing data (believe because I know it's done, but don't have concrete information about how common it is). Your use of Wikipedia likely has less privacy than your use of a public library. SSL kills these attacks dead. People whom try to read via Tor to avoid the above mentioned problems subject themselves to naughty activities by unscrupulous exit operators. MITM activities by Tor exit operators are common and well documented. SSL would remove some of the incentive to use Tor (since your local network/ISP could no longer spy on you if you used SSL) and would remove most of Tor's grievous hazard for those who continue to use it to read. There are some truly nasty things you can do with an enwiki admin account. They can be undone, sure, but a lot of damage can be done. They are obvious enough, and have been discussed in backrooms enough that I don't think I'll do much harm by listing a few of them: (1) By twiddling site JS you can likely knock any site off the internet by scripting clients to connect to the sites frequently. Although this can be deactivated once it was discovered, due to caching it would hang around for a while. Well timed even a short outage could cause significant dollar value real damage. (2) You could script clients to kick users to a malware installer. Again, it could be quickly undone, but a lot of damage could be caused with only a few minutes of script placement. Generally you could use WP as a nice launching ground for any kind of XSS vulnerability that you're already aware of. Any of these JS attacks could be enhanced by only making them effective for anons, reducing their visibility, and by making the JS modify the display of the Mediawiki: pages to both hide the bad JS from users and to make it impossible to remove without disabling client JS. Provided your changes didn't break the site, I'd take a bet that you could have a malware installer running for days before it was discovered. (3) You could rapidly merge page histories for large numbers of articles, converting their histories into jumbled messes. I don't believe we yet have any automated solution to fix that beyond restore the site from backups. (4) Any admin account can be used to capture bureaucrat and/or checkuser access by injecting user JS to one of these users and using it to steal their session cookie (unless the change to SUL stopped this, but I don't see how it could have; even if so you could remote pilot them). With checkuser access you can quickly dump out decent amounts of private data. The leak of private data can never be undone. (or, alternatively, you can just MTIM a real steward, checkuser, or bureaucrat (say, at wikimania or a wiki meetup :) ) and get their access directly). These are just a few things… I'm sure if you think creatively you can come up with more. The use of SSL makes attacks harder and some types of attack effectively impossible. It should be considered important. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Proposal: switch to HTML 5
On Wed, Jul 8, 2009 at 2:23 PM, Michael Dalemd...@wikimedia.org wrote: The current language is For best video playback experience we recommend _Firefox 3.5_ ... but I am open to adjustments. I'd drop the word experience. It's superfluous marketing speak. So the notice chain I'm planning on adding to the simple video/ compatibility JS is something like this: If the user is using safari4 on a desktop system and doesn't have xiphqt: * Advise the user to install XiphQT (note, there should be a good installer available soon) The rational being that if they are known to use safari now they probably will in the future, better to get them to install XiphQT than to hope they'll continue using another browser. If the users is using any of a list of platforms known to support firefox: * Advise them to use firefox 3.5 Otherwise say nothing. It would be silly at this time to be advising users of some non-firefox-supporting mobile device that firefox 3.5 provides the best experience. ;) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Recommending a browser for video (was: Proposal: switch to HTML 5)
On Wed, Jul 8, 2009 at 2:56 PM, Aryeh Gregorsimetrical+wikil...@gmail.com wrote: On Wed, Jul 8, 2009 at 2:43 AM, Marco Schusterma...@harddisk.is-a-geek.org wrote: We should not recommend Chrome - as good as it is, but it has serious privacy problems. Opera is not Open Source, so I think we'd best stay with Firefox, even if Chrome/Opera begin to support video tag. I don't think we should use these kinds of ideological criteria when making any sort of recommendation here. We should state in a purely neutral fashion that browsers X, Y, and Z will result in the video playing better on your computer than your current browser does. It would be misleading to imply that Firefox is superior to these other browsers for the purposes of playing the video tag. Not every decision is a purely technical. Mozilla has done a lot to support the development of this functionality. Putting other browser developers on equal footing is not an neutral decision either. The ideological, and other, criteria is moot when there is only one thing to recommend. On Wed, Jul 8, 2009 at 2:42 PM, Gregory Maxwellgmaxw...@gmail.com wrote: That sounds good. Why not recommend Safari plus XiphQT as well, if the goal is only to tell them what browsers support good video playback? Hm. Two things to install rather than one? For the moment there is also a technical problem with Safari 4: It claims (via the canPlayType() call) that it can't support Ogg even when XiphQT is installed. We currently work around this by detecting the mime-type registration which happens as part of the XiphQT installation. In practice this means that Safari 4 will work with Ogg video on sites using OggHandler, but not on many others. Safari also isn't an especially widely adopted browser outside of apple systems. Should we also recommend the dozens of oddball free geko and webkit based browsers supporting video/ which are soon to exist? Flooding the users with options is a good way to turn them off. There is already at least one (Midori). ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Proposal: switch to HTML 5
On Wed, Jul 8, 2009 at 3:06 PM, David Gerarddger...@gmail.com wrote: 2009/7/8 j...@v2v.cc: David Gerard wrote: You are using Internet Explorer. Install the Ogg codecs _here_ for a greatly improved Wikimedia experience. Internet Explorer does not support the video tag, installing Ogg DirectShow filters does not help there. Yes, I realised this just after sending my email :-) I presume, though, there's some way of playing videos in IE. Is there a way to tell if the Ogg filters are installed? Java or via the VLC plugin At least the safari + xiphqt has the benefit of working as well as firefox 3.5 does. The same is not true for Java or VLC. (the VLC plugin is reported to cause many browser crashes, Java is slow to launch and somewhat CPU hungry) I've suggested making the same installer for XiphQT for win32 also install the XiphDS plugins, which would make things easier on users. But XiphDS does not help with in-browser playback today. Since, at the moment, firefox is the only non-beta browser with direct support I don't see why plugging Firefox would be controversial. It's a matter of fact that it works best with Firefox 3.5 or Safari+XiphQT. Later when there are several options things will be a little more complicated. Certainly I don't think any recommendation should be made when the user already has native-grade playback. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Recommending a browser for video (was: Proposal: switch to HTML 5)
On Wed, Jul 8, 2009 at 6:12 PM, David Gerarddger...@gmail.com wrote: 2009/7/8 Aryeh Gregor simetrical+wikil...@gmail.com: On Wed, Jul 8, 2009 at 4:27 PM, David Gerarddger...@gmail.com wrote: Uh, it's not a good option for Wikimedia video. With XiphQT, why not? Maybe not ideal, but surely good. As Greg has noted, due to a bug in Safari it's impossible for the browser at present to indicate that it can handle Ogg or not. So how do we tell if the Safari user can use that or if they have to download XiphQT? There isn't a way at present. Either we shove Safari on Mac users onto Cortado by default (since Java can be presumed present on MacOS X) or we risk giving them a video element that doesn't work. (Unless the failure can somehow be sniffed.) Well *we* do. As a side effect of installing XiphQT a mime type is registered. This is completely independent of the video tag. So we'll detect this and use it anyways. I believe we're the only users of video whom have ever done this. It's not obvious, and I doubt we'd be doing it were it not for the fact that that detection method was previously used for detecting pre-video availability of XiphQT. (FWIW, that behaviour is now fixed in their development builds) Regardless, I think we've finished the technical part of this decision— the details are a matter of organization concern now, not technology. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Proposal: switch to HTML 5
On Tue, Jul 7, 2009 at 1:54 AM, Aryeh Gregorsimetrical+wikil...@gmail.com wrote: [snip] * We could support video/audio on conformant user agents without the use of JavaScript. There's no reason we should need JS for Firefox 3.5, Chrome 3, etc. Of course, that could be done without switching the rest of the site to HTML5... Although I'm not sure that giving the actual video tags is desirable. It's a tradeoff: Work for those users when JS is enabled and correctly handle saving the full page including the videos vs take more traffic from clients doing range requests to generate the poster image, and potentially traffic from clients which decide to go ahead and fetch the whole video regardless of the user asking for it. There is also still a bug in FF3.5 that where the built-in video controls do not work when JS is fully disabled. (Because the controls are written in JS themselves) (To be clear to other people reading this the mediawiki ogghandler extension already uses HTML5 and works fine with Firefox 3.5, etc. But this only works if you have javascript enabled. The site could instead embed the video elements directly, and only use JS to substitute the video tag for fallbacks when it detects that the video tag can't be used) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Proposal: switch to HTML 5
On Tue, Jul 7, 2009 at 7:53 PM, Michael Dalemd...@wikimedia.org wrote: [snip] I don't really have apple machine handy to test quality of user experience in OSX safari with xiph-qt. But if that is on-par with Firefox native support we should probably link to the component install instructions for safari users. I believe it's quite good. Believe is the best I can offer never having personally tested it. I did work with a safari user sending them specific test cases designed to torture it hard (and some XiphQT bugs were fixed in the process) and at this point it sounds pretty good. What I have not stressed is any of the JS API. I know it seeks, I have no clue how well, etc. There is also an apple webkit developer who is friendly and helpful at getting things fixed whom we work with if we do encounter bugs... but more testing is really needed. Safari users wanted. As far as the 'soft push' ... I'm generally not a big fan of one-shot completely dismissible nags: Too often I click past something only to realize shortly thereafter that I really should have clicked on it. I'd prefer something that did a significant (alert-level) nag *once* but perpetually included a polite Upgrade your Video button below (above?) the fallback video window. There is only a short period of time remaining where a singular browser recommendation can be done fairly and neutrally. Chrome and Opera will ship production versions and then there will be options. Choices are bad for usability. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Regular expressions searching
On Mon, Jul 6, 2009 at 2:37 PM, Aryeh Gregorsimetrical+wikil...@gmail.com wrote: On Mon, Jul 6, 2009 at 7:43 AM, Andrew Garrettagarr...@wikimedia.org wrote: Yes. We wouldn't allow direct searching from the web interface with regexes for two related reasons: 1/ A single search with the most basic of regexes would take several minutes, if not hours. It isn't computationally trivial to search for a small string in a very large string of over 10 GB, let alone a regex. Words can be indexed, regexes cannot. 2/ Even if we could find a way to make the former performant, a malicious regex could significantly expand this time taken, leading to a denial of service. I seem to recall Gregory Maxwell describing a setup that made this feasible, given the appropriate amount of dedicated hardware. It was run with the entire database in memory; it only permitted real regular expressions (compilable to finite-state machines, no backreferences etc.); and it limited the length of the finite-state machine generated. Running a regex took several minutes, but he'd run a lot of them in parallel, since it was mostly memory-bound, so he got fairly good throughout. Something like that. But probably not practical without an undue amount of effort and hardware, yeah. :) Yes, I didn't comment on the initial comment because full PCRE is simply far too much to ask for. Basic regexps of the sort that can be complied into a deterministic finite state machine (i.e. no backtracking) can be merged together into a single larger state machine. So long as the state machine fits in cache, the entire DB can be scanned in not much more time than it takes to read it in from memory, even if there are hundreds of parallel regexpes. So you batch up user requests then run them in parallel groups. Good throughput, poor latency. Insufficiently selective queries are problematic. I never came up with a really good solution to people feeding in patterns like '.' and stalling the whole process by wasting a lot of memory bandwidth updating the result set. (an obvious solution might just be to limit the number of results) The latency can be reduced by partitioning the database across multiple machines (more aggregate memory bandwidth). By doing this you could achieve arbitrarily low latency and enormous throughput. Dunno if it's actually worthwhile, however. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] On templates and programming languages
On Wed, Jul 1, 2009 at 1:42 AM, Dmitriy Sintsovques...@rambler.ru wrote: XSLT itself is a way too much locked down - even simple things like substrings manipulation and loops aren't so easy to perform. Well, maybe I am too stupid for XSLT but from my experience bringing tag syntax in programming language make the code poorly readable and bloated. I've used XSLT for just one of my projects. Juniper Networks (my day job) uses XSLT as the primary scripting language on their routing devices, and chose to do so primarily because of sandboxing and the ease of XML tree manipulation with xpath (JunOS configuration has a complete and comprehensive XML representation). To facilitate that usage we defined an alternative syntax for XSLT called SLAX (http://code.google.com/p/libslax/), though it hasn't seen widespread adoption outside of Juniper yet. (Slax can be mechanically converted to XSLT and vice versa) SLAX pretty much resolves your readability concern. Although there are the conceptual barriers for people coming from procedural languages to any strongly functional programming language still remain. You don't loop in XSLT, you recurse or iterate over a structure (i.e. map/reduce). I've grown rather fond of XSLT but wouldn't personally recommend it for this application. It lacks the high speed bytecoded execution environments available for other languages, snf I don't see many scripts on the site doing extensive document tree manipulation (it's hard for me to express how awesome xpath is at that)... and I would also guess that there are probably more adept mediawiki template language coders today than there are people who are really fluent in XSLT. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] On templates and programming languages
On Wed, Jul 1, 2009 at 3:50 AM, William Allen Simpsonwilliam.allen.simp...@gmail.com wrote: Javascript, OMG don't go there. Don't be so quick to dismiss Javscript. If we were making a scorecard it would likely meet most of the checkboxes: * Available of reliable battle tested sandboxes (and probably the only option discussed other than x-in-JVM meeting this criteria) * Availability of fast execution engines * Widely known by the existing technical userbase (JS beats the other options hands down here) * Already used by many Mediawiki developers * Doesn't inflate the number of languages used in the operation of the site * Possibility of reuse between server-executed and client-executed (Only JS of the named options meets this criteria) * Can easily write clear and readable code * Modern high level language features (dynamic arrays, hash tables, etc) There may exist great reasons why another language is a better choice, but JS is far from the first thing that should be eliminated. Python is a fine language but it fails all the criteria I listed above except the last two. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] On templates and programming languages
On Wed, Jul 1, 2009 at 11:21 AM, William Allen Simpsonwilliam.allen.simp...@gmail.com wrote: * Doesn't inflate the number of languages used in the operation of the site This is the important checkbox, as far as integration with the project (my first criterion), but is the server side code already running JavaScript? For serving pages? No but mediawiki and the sites are already chock-full of client side code in JS. You basically can't do advanced development for MediaWiki or the wikimedia sites without a degree of familiarity with Javascript due to client compatibility considerations. My general rule: coming over the network, presume it's bad data. In this case were not talking about the language mediawiki is written in, we're talking about a language used for server-side content automation (templates). In that case we'd be assuming the inputs are toxic just like in the client side case, since everything, including the code itself came in over the network. I'll concede that there likely wouldn't be much code reuse, but I'd attribute that more to the starkly different purpose and the fact that the server version would have a different API (no DOM, but instead functions for pulling data out of mediawiki). And we have far too many examples of existing JS already being used in horrid templates, being promulgated in important areas such as large categories, that don't seem to work consistently, and don't work at all with JavaScript turned off. I run Firefox with JS off by default for all wikimedia sites, because of serious problems in the not so recent past! Fortunately this is a non-issue here: Better server side scripting enhances the sites ability to operate without requiring scripting on the client. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Technical solution to the privileged users adding web bugs problem
Shutting Down XSS with Content Security Policy http://blog.mozilla.com/security/2009/06/19/shutting-down-xss-with-content-security-policy/ I'm usually the first to complain about applying technical solutions to problems which are not fundamentally technical... but this looks like it would be reasonably expedient to implement. While it won't be effective for all users the detection functionality would be a big improvement in wrangling these problems across the hundreds of Wikimedia projects, many of which lack reasonable oversight of their sysop activities. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Technical solution to the privileged users adding web bugs problem
On Mon, Jun 29, 2009 at 7:56 PM, Aryeh Gregorsimetrical+wikil...@gmail.com wrote: I think this would be reasonable to consider implementing as soon we have a significant number of users using it. It isn't a good idea to make CSP policies that won't actually be effective immediately for a lot of people, because then we'll probably use it incorrectly, break tons of stuff, and not even notice for months or years (possibly even harming uptake of the first version of Firefox to support it). This does seem to be Mozilla-only, though. If it were an open specification that multiple vendors were committed to implementing, that would make it significantly more attractive. I wonder why Mozilla isn't proposing this through the W3C from the get-go. When to do it is a philosophical issue: Arguably it should be turned on early so that the early adopters of the technology (i.e. firefox devs!) will be test subjects. Support for audio/video tag in Wikipedia has been helpful in the development of firefox audio and video tag support. If the feature is turned on only once these clients are widely deployed then we'll have a situation where things may be broken for many users. So— turn it on early and have many things will broken for a small number of technically savvy users, up to the point potentially slowing the adoption of a future browser release. ... or turn it on later when it will likely cause a few problems but for 30% of the sites visitors? The latter sounds like too much of a flag-day. The stuff likely to stay broken after the initial implementation are things like userscripts. Those are just going to take a long time to fix no matter what. The best thing there would be to communicate the correct practices well in advance so that the natural development cycle picks them up, but I'm not aware of any way to communicate such a thing except by making the wrong ways not work. We'd have to do some work to get full benefit from this, since we currently use stuff like inline script all over the place. But it Right, though with all the minification interest I've seen here lately it sounds like a great time to hoist all that stuff out of the pages. would be fairly trivial to use only *-src to deny any remote loading of content from non-approved domains, and skip the rest. That would at least mitigate XSS some, but it would stop the privacy issues we've been having cold, as you say. I think one really compelling thing about it is that supporting clients can provide feedback to the webserver. This means that every supporting user will be an XSS test probe, a canary in the page-mine. So even if this doesn't become standardized and widely adopted by clients other than firefox it would reduce the damage of unintentional but well meaning privacy leaks since we'd get notice of them very quickly rather than months later. Hopefully this will be more widely adopted, because I think that the available knobs provide a level of functionality which we couldn't achieve any other way. (i.e. we could deny html/script injection completely in mediawiki, but limiting scripts to accessing particular domains isn't something mediawiki could reasonably do itself) I don't know enough to comment on the W3C path— but I have no particular reason to think it wouldn't happen: W3C activity is almost universally lagging rather than leading. Things like this aren't generally matters for discussion unless someone is thinking of implementing them. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Mediawiki and html5
On Sat, Jun 27, 2009 at 4:39 AM, Strainustrain...@gmail.com wrote: Hi, I've heard that wikipedia will be among the first content providers to support the video and audio tags in html5. I'm trying to put up a presentation about the subject for a FF3.5 release party and I would like to find out more. Could you point me to some documents or answer some of the questions below? 1) When will this support appear? 2) Has the code already been modified accordingly? Stephen Bain addressed this admirably, but I thought I should add that the support has been there for years now. We've been waiting browser vendors to catch up. Even prior to Opera's push for the video tag we had in-browser java based playback of Ogg files on English Wikipedia. 3) How much time will legacy browsers be supported? For Wikimedia legacy browser support is fairly inexpensive: They play back the same files that the video/audio users. So legacy support can last as long as its relevant. For sites who have used other formats for legacy browsers, they have the cost of maintaining another set of encodes and format royalties, so for them there may be more incentive to drop legacy support. There is also a question of what constitutes 'legacy': There is one desktop browser can play our video perfectly adequately using the HTML5 tags, but it requires a codec pack. 4) What prompted this desire to be an early adopter of this technology? Wikimedia has a long-standing commitment to open and unencumbered file formats which stems back to nearly the start of the projects. The mission of the Wikimedia Foundation is to empower and engage people around the world to collect and develop educational content under a free license or in the public domain, and to disseminate it effectively and globally, and it has been the belief that people are more empowered when they don't feel forced for compatibility reasons to use formats they have to ask permission for and pay for. As such, the use of encumbered video technology such as flash is not something that would be decided lightly. The adoption of the HTML5 tags follows naturally from this pre-existing behavior as a way of getting media working for a larger portion of the userbase. 5) Will other codecs except Theora be supported? The list of file types supported today can be found here: http://commons.wikimedia.org/wiki/Commons:File_types Really the support just depends on the intersection of the project requirements (as of today: free and unencumbered formats) and client support (as of today, Ogg/Theora has the widest client compatibility for HTML5 video). The thumbnailing infrastructure for video currently only handles Ogg/Theora but other formats could be easily added. This one isn't really a technical question. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] subst'ing #if parser functions loses line breaks, and other oddities
On Fri, Jun 26, 2009 at 12:01 PM, Gerard Meijssengerard.meijs...@gmail.com wrote: Hoi, At some stage Wikipedia was this thing that everybody can edit... I can not and will not edit this shit so what do you expect from the average Joe ?? I can not (effectively) contribute to http://en.wikipedia.org/wiki/Ten_Commandments_in_Roman_Catholicism Does this mean Wikipedia is a failure? I don't think so. Not everyone needs to be able to do everything. Thats one reasons projects have communities: Other people can do the work which I'm not interested in or not qualified for. Not everyone needs to make templates— and there are some people who'd have nothing else to do but add fart jokes to science articles if the site didn't have plenty of template mongering that needed doing. Unfortunately the existing system is needlessly exclusive. The existing parser function uses solution are so byzantine that even many people with the right interest and knowledge are significantly put off from it. The distinction between this and a general easy to use is a very critical one. It's also the case that the existing system's problems spills past its borders due to its own limitations: Regular users need to deal with things like weird whitespace handling and templates which MUST be substed (or can't be substed; at random from the user's perspective). This makes the system harder even for the vast majority of people who should never need to worry about the internals of the templates. I think this is the most important issue, and its one with real usability impacts, but it's not due to the poor syntax. On this point, the template language could be intercal but still leave most users completely free to ignore the messy insides. The existing system doesn't because there is no clear boundary between the page and the templates (among other reasons, like the limitations of the existing 'string' manipulation functions). ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Minify
On Fri, Jun 26, 2009 at 4:33 PM, Michael Dalemd...@wikimedia.org wrote: I would quickly add that the script-loader / new-upload branch also supports minify along with associating unique id's grouping gziping. So all your mediaWiki page includes are tied to their version numbers and can be cached forever without 304 requests by the client or _shift_ reload to get new js. Hm. Unique ids? Does this mean the every page on the site must be purged from the caches to cause all requests to see a new version number? Is there also some pending squid patch to let it jam in a new ID number on the fly for every request? Or have I misunderstood what this does? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Unbreaking statistics
On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohderaro...@gmail.com wrote: On Fri, Jun 5, 2009 at 6:38 PM, Tim Starlingtstarl...@wikimedia.org wrote: Peter Gervai wrote: Is there a possibility to write a code which process raw squid data? Who do I have to bribe? :-/ Yes it's possible. You just need to write a script that accepts a log stream on stdin and builds the aggregate data from it. If you want access to IP addresses, it needs to run on our own servers with only anonymised data being passed on to the public. http://wikitech.wikimedia.org/view/Squid_logging http://wikitech.wikimedia.org/view/Squid_log_format How much of that is really considered private? IP addresses obviously, anything else? I'm wondering if a cheap and dirty solution (at least for the low traffic wikis) might be to write a script that simply scrubs the private information and makes the rest available for whatever applications people might want. There is a lot of private data in user agents (MSIE 4.123; WINNT 4.0; bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34 may be uniquely identifying). There is even private data titles if you don't sanitize carefully (/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box). There is private data in referrers (http://rarohde.com/url_that_only_rarohde_would_have_comefrom). Things which individually do not appear to disclose anything private can disclose private things (look at the people uniquely identified by AOL's 'anonymized' search data). On the flip side, aggregation can take private things (i.e. useragents; IP info; referrers) and convert it to non-private data: Top user agents; top referrers; highest traffic ASNs... but becomes potentially revealing if not done carefully: The 'top' network and user agent info for a single obscure article in a short time window may be information from only one or two users, not really an aggregation. Things like common paths through the site should be safe so long as they are not provided with too much temporal resolution, limit themselves to existing articles, and limit themselves to either really common paths or breaking paths into two or three node chains and skip releasing the least common of those. Generally when dealing with private data you must approach it with the same attitude that a C coder must take to avoid buffer overflows. Treat all data as hostile, assume all actions are potentially dangerous. Try to figure out how to break it, and think deviously. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Google web bugs in Mediawiki js from admins - technical workarounds?
On Thu, Jun 4, 2009 at 10:19 AM, David Gerard dger...@gmail.com wrote: Keeping well-meaning admins from putting Google web bugs in the JavaScript is a game of whack-a-mole. Are there any technical workarounds feasible? If not blocking the loading of external sites entirely (I understand hu:wp uses a web bug that isn't Google), perhaps at least listing the sites somewhere centrally viewable? Restrict site-wide JS and raw HTML injection to a smaller subset of users who have been specifically schooled in these issues. This approach is also compatible with other approaches. It has the advantage of being simple to implement and should produce a considerable reduction in problems regardless of the underlying cause. Just be glad no one has yet turned english wikipedia's readers into their own personal DDOS drone network. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Google web bugs in Mediawiki js from admins - technical workarounds?
On Thu, Jun 4, 2009 at 10:53 AM, David Gerard dger...@gmail.com wrote: I understand the problem with stats before was that the stats server would melt under the load. Leon's old wikistats page sampled 1:1000. The current stats (on dammit.lt and served up nicely on http://stats.grok.se) are every hit, but I understand (Domas?) that it was quite a bit of work to get the firehose of data in such a form as not to melt the receiving server trying to process it. OK, then the problem becomes: how to set up something like stats.grok.se feasibly internally for all the other data gathered from a hit? (Modulo stuff that needs to be blanked per privacy policy.) What exactly are people looking for that isn't available from stats.grok.se that isn't a privacy concern? I had assumed that people kept installing these bugs because they wanted source network break downs per-article and other clear privacy violations. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Google web bugs in Mediawiki js from admins - technical workarounds?
On Thu, Jun 4, 2009 at 11:01 AM, Mike.lifeguard mikelifegu...@fastmail.fm wrote: On Thu, 2009-06-04 at 15:34 +0100, David Gerard wrote: Then external site loading can be blocked. Why do we need to block loading from all external sites? If there are specific problematic ones (like google analytics) then why not block those? Because: (1) External loading results in an uncontrolled leak of private reader and editor information to third parties, in contravention of the privacy policy as well as basic ethical operating principles. (1a) most external loading script usage will also defeat users choice of SSL and leak more information about their browsing to their local network. It may also bypass any wikipedia specific anonymization proxies they are using to keep their reading habits private. (2) External loading produces a runtime dependency on third party sites. Some other site goes down and our users experience some kind of loss of service. (3) The availability of external loading makes Wikimedia a potential source of very significant DDOS attacks, intentional or otherwise. Thats not to say that there aren't reasons to use remote loading, but the potential harms mean that it should probably be a default-deny permit-by-exception process rather than the other way around. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] flagged revisions
On Wed, May 20, 2009 at 9:58 PM, Bart banati...@gmail.com wrote: I don't know about those flagged revisions. After a while, it would basically mean that every edit and page view would be doubled. For most [snip] Sorry to be curt, but why do people who have a weak understanding of the functionality available feel so compelled to make comments like this? The software supports automatically preserving the standing flagging (or some portion of it) when users with the authority to set those flags make edits. This eliminates the inherit doubling. The flagging communicates to users that a revision has been reviewed to some degree by an established user. This should allow review resources to applied more effectively rather than having 100 people review every change to a popular article while changes to less popular articles end up insufficiently reviewed. Furthermore, the existence of flagged versions in the history means that when a series of unflagged revisions are made they can be reviewed in a single action by viewing the diff against the the single most recent 'known-probably-good' flagged revision. Without these points in the history every single edit must be individually reviewed. The exact change in workload isn't clear: If there is an increase in workload then it would come from an increase from performing a review of changes by less-established users (those unable to set the flags) which previously went completely without review. I hope that there isn't currently enough completely unreviewed material that it would offset the time saving improvements of collaborative review and known-good comparison points. I'm sure that it is possible to find worthwhile criticisms of the flagging functionality (or the particular configuration requested by EnWP), but many people have worked very hard on this functionality and many of most obvious possible problems have been addressed. To produce an effective criticism you're going to need to spend a decent amount of time researching, reading discussion history, trying the software, etc. Maybe if you do you'll find that the functionality isn't as frightening as you feared and hopefully you'll find a new possible problem which can actually be addressed without rejecting this attempt at forward progress. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dealing with Large Files when attempting a wikipedia database download - Focus upon Bittorrent List
On Fri, Apr 17, 2009 at 9:42 PM, Gregory Maxwell gmaxw...@gmail.com wrote: [snip] But if you are running parallel connections to avoid slowdowns you're just attempting to cheat TCP congestion control and get an unfair share of the available bandwidth. That kind of selfish behaviour fuels non-neutral behaviour and ought not be encouraged. [snip] On Sat, Apr 18, 2009 at 3:06 AM, Brian brian.min...@colorado.edu wrote: I have no problem helping someone get a faster download speech and I'm also not willing to fling around fallacies about how selfish behavior is bad for society. Here is wget vs. aget for the full history dump of the simple [snip] And? I did point out this is possible, and that no torrent was required to achieve this end. Thank you for validating my point. Since you've called my position fallacious I figure I ought to give it a reasonable defence, although we've gone off-topic. The use of parallel TCP has allowed you an inequitable share of the available network capacity[1]. The parallel transport is fundamentally less efficient as it increases the total number of congestion drops[2]. The categorical imperative would have us not perform activities that would be harmful if everyone undertook them. At the limit: If everyone attempted to achieve an unequal share of capacity by running parallel connections the internet would suffer congestion collapse[3]. Less philosophically and more practically: the unfair usage of capacity by parallel fetching P2P tools is a primary reason for internet providers to engage in 'non-neutral' activities such as blocking or throttling this P2P traffic[4][5][6]. Ironically, a provider which treats parallel transport technologies unfairly will be providing a more fair network service and non-neutral handling of traffic is the only way to prevent an (arguably unfair) redistribution of transport towards end user heavy service providers. (I highly recommend reading the material in [5] for a simple overview of P2P fairness and network efficiency; as well as the Briscone IETF draft in [4] for a detailed operational perspective) Much of the public discussion on neutrality has focused on portraying service providers considering or engaging in non-neutral activities as greedy and evil. The real story is far more complicated and far less clear cut. Where this is on-topic is that non-neutral behaviour by service providers may well make the Wikimedia Foundation's mission more costly to practice in the future. In my professional opinion I believe the best defence against this sort of outcome available to organizations like Wikimedia (and other large content houses) is the promotion of equitable transfer mechanisms which avoid unduly burdening end user providers and therefore providing an objective justification for non-neutral behaviour. To this end Wikimedia should not promote or utilize cost shifting technology (such as P2P distribution) or inherently unfair inefficient transmission (parallel TCP; or fudged server-side initial window) gratuitously. I spent a fair amount of time producing what I believe to be a well cited reply which I believe stands well enough on its own that I should not need to post any more in support of it. I hope that you will at least put some thought into the issues I've raised here before dismissing this position. If my position is fallacious then numerous academics and professionals in the industry are guilty of falling for the same fallacies. [1] Cho, S. 2006 Congestion Control Schemes for Single and Parallel Tcp Flows in High Bandwidth-Delay Product Networks. Doctoral Thesis. UMI Order Number: AAI3219144., Texas A M University. [2] Padhye, J., Firoiu, V. Towsley, D. and Kurose, J., Modeling TCP throughput: a simple model and its empirical validation. ACMSIGCOMM, Sept. 1998. [3] Floyd, S., and Fall, K., Promoting the Use of End-to-End Congestion Control in the Internet, IEEE/ACM Transactions on Networking, Aug. 1999. [4] B. Briscoe, T. Moncaster, L. Burness (BT), http://tools.ietf.org/html/draft-briscoe-tsvwg-relax-fairness-01 [5] Nicholas Weaver presentation Bulk Data P2P: Cost Shifting, not Cost Savings (http://www.icsi.berkeley.edu/~nweaver/p2pi_shifting.ppt); Nicholas Weaver Position Paper P2PI Workshop http://www.funchords.com/p2pi/1 p2pi-weaver.txt [6] Bruno Tuffin, Patrick Maillé: How Many Parallel TCP Sessions to Open: A Pricing Perspective. ICQT 2006: 2-12 ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dealing with Large Files when attempting a wikipedia database download - Focus upon Bittorrent List
On Fri, Apr 17, 2009 at 6:10 PM, Chad innocentkil...@gmail.com wrote: I seem to remember there being a discussion about the torrenting issue before. In short: there's never been any official torrents, and the unofficial ones never got really popular. Torrent isn't a very good transfer method for things which are not fairly popular as it has a fair amount of overhead. The wikimedia download site should be able to saturate your internet connection in any case… ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dealing with Large Files when attempting a wikipedia database download - Focus upon Bittorrent List
On Fri, Apr 17, 2009 at 9:21 PM, Stig Meireles Johansen sti...@gmail.com wrote: But some ISP's throttle TCP-connections (either by design or by simple oversubscription and random packet drops), so many small connections *can* yield a better result for the end user. And if you are so unlucky as to having a crappy connection from your country to the download-site, maybe, just maybe someone in your own country already has downloaded it and is willing to share the torrent... :) I can saturate my little 1M ADSL-link with torrent-downloads, but forget about getting throughput when it comes to HTTP-requests... if it's in the country, in close proximity and the server is willing, then *maybe*.. but else.. no way. There are plenty of downloading tools that will use range requests to download a signal file with parallel connections… But if you are running parallel connections to avoid slowdowns you're just attempting to cheat TCP congestion control and get an unfair share of the available bandwidth. That kind of selfish behaviour fuels non-neutral behaviour and ought not be encouraged. We offered torrents in the past for commons picture of the year results— a more popular thing to download, a much smaller file (~500mb vs many gbytes), and not something which should become outdated every month… and pretty much no one stayed connected long enough for anyone else to manage to pull anything from them. It was an interesting experiment, but it indicated that further use for these sorts of files would be a waste of time. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Large nested templates (example: NYRepresentatives)
On Tue, Apr 14, 2009 at 10:52 AM, Sergey Chernyshev sergey.chernys...@gmail.com wrote: Domas, In this particular case, template will just contain an SMW query to get all representatives. [snip] How does this avoid merely shifting the load from the parser (on the plentiful application servers) to the database? Not that more intelligence isn't good— From a content-maintenance perspective something query based is doubtlessly better than some static serialized lump, but the complaint here was performance as far as I can tell, and I think thats a more complicated question. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] ANNOUNCE: OpenStreetMap maps will be added to Wikimedia projects
On Sun, Apr 5, 2009 at 10:12 PM, Brian brian.min...@colorado.edu wrote: Great. Let us know when you've got community approval. Better than a simple super-majority too per the president set in the recent discussions related to revision flagging. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Providing simpler dump format (raw, SQL or CSV)?
On Tue, Mar 31, 2009 at 10:02 AM, Christensen, Courtney christens...@battelle.org wrote: -Original Message- Given that the current dump process is having problem, why not provide a simple fix such as providing raw table format , SQL files or even CSV files? Howard, Can't you get the SQL files from running mysqldump from the command line? Why does something new need to be created? I hope I'm not being dense, but I don't understand what new niche you are asking to fill. Because the data (text) isn't in a single database, even for a single project, it is spread across a large number of machines. It's also in a mixture of bizarre internal formats. The file format it pretty much irrelevant to the 'cost' of producing a dump. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] PDF vulnerability
On Fri, Feb 20, 2009 at 12:57 PM, Platonides platoni...@gmail.com wrote: [snip] It could also pass a virus scan but I don't think it's really needed. Virus scanners mainly look for known bad code, inside executables. We don't want any kind of executable. I've run clamav against the entire set of files in the past. Found a couple of interesting things (like, 3 files out of millions). Converting pdftops and back will probably totally kill the text layer. Might as well render to images and djvu. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Javascript localization, minify, gzip cache forever
On Fri, Feb 20, 2009 at 5:51 PM, Brion Vibber br...@wikimedia.org wrote: [snip] On the other hand we don't want to delay those interactions; it's probably cheaper to load 15 messages in one chunk after showing the wizard rather than waiting until each tab click to load them 5 at a time. But that can be up to the individual component how to arrange its loads... Right. It's important to keep in mind that in most cases the user is *latency bound*. That is to say that the RTT between them and the datacenter is the primary determining factor in the load time, not how much data is sent. Latency determines the connection time, it also influences how quickly rwin can grow and get you out of slow-start. When you send more at once you'll also be sending more of it with a larger rwin. So in terms of user experience you'll usually improve results by sending more data if doing so is able to save you a second request. Even ignoring the users experience— connections aren't free. There is byte-overhead in establishing a connection. Byte-overhead in lost compression by working with smaller objects. Byte-overhead in having more partially filled IP packets. CPU overhead from processing more connections, etc. Obviously there is a line to be drawn— You wouldn't improve performance by sending the whole of Wikipedia on the first request. But you will most likely not be conserving *anything* by avoiding sending another kilobyte of compressed user interface text for an application a user has already invoked, even if only a few percent use the additional messages. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] inconsistent precision in PHP output
On Wed, Feb 11, 2009 at 12:29 PM, Robert Rohde raro...@gmail.com wrote: Yes Domas, haha, because no one would ever want to write about math or high precision scientific measurements in an encyclopedia. Holy crud! You don't use floating point for this! If you need deterministic behaviour and high accuracy you need to confine yourself to integer mathematics. Sure, *Write about* high precision scientific measurements in Wikipedia, but don't use Wikipedia to *make them*. [snip] Am I wrong in thinking that the server admins should care when different machines produce different output from the same code? In this case, the behavior suggests it may be as simple as ensuring that the servers have the same php.ini precision settings. Is there any reason to think that this is related to to a PHP setting rather than being a result of differences in compiler decisions with respect to moving variables in in off the x87 stack and into memory or the use of SSE? Or some libc difference in how the FPU rounding mode is set? At 12 digits you are beyond the expected precision of single precision floating point, and not far from what you get with doubles. On x86 the delivered precision can vary wildly depending on the precise sequence of calculations are register spills. For code compiled without -ffast-math the former should be stable for a single piece of code, but the latter is anyone's guess. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] – Fixing {val}
On Sat, Jan 31, 2009 at 8:33 PM, Robert Rohde raro...@gmail.com wrote: This discussion is getting side tracked. The real complaint here is that {{#expr:(0.7 * 1000 * 1000) mod 1000}} is giving 69 when it should give 70. This is NOT a formatting issue, but rather it is bug in the #expr parser function, presumably caused by some kind of round-off error. It's a bug in the user's understanding of floating point on computers, combined with % being (quite naturally) an operator on integers. 0.7… does not exist in your finite precision base-2 based computer. I don't think it's reasonable for Mediawiki to include a full radix-n multi-precision floating point library in order to capture the expected your behavior for these cases, any more than it would be reasonable to expect it to contain a full computer algebra system so it could handle manipulations of irrationals precisely. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l