Re: [Wikitech-l] Jake requests enabling access and edit access to Wikipedia via TOR

2014-01-12 Thread Gregory Maxwell
On Tue, Dec 31, 2013 at 1:08 AM, Martijn Hoekstra
martijnhoeks...@gmail.com wrote:
 Does Jake have any mechanism in mind to prevent abuse? Is there any
 possible mechanism available to prevent abuse?

Preventing abuse is the wrong goal. There is plenty of abuse even
with all the privacy smashing new editor deterring convolutions that
we can think up.  Abuse is part of the cost of doing business of
operating a publicly editable Wiki, it's a cost which is normally well
worth its benefits.

The goal needs to merely be to limit the abuse enough so as not to
upset the abuse vs benefit equation. Today, people abuse, they get
blocked, they go to another library/coffee shop/find another
proxy/wash rinse repeat.  We can't do any better than that model, and
it turns out that it's okay.  If a solution for tor users results in a
cost cost (time, money, whatever unit of payment is being expended)
for repeated abuse comparable to the other ways abusive people access
the site then it should not be a major source of trouble which
outweighs the benefits. (Even if you do not value freedom of
expression and association for people in less free parts of the world
at all).

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Jake requests enabling access and edit access to Wikipedia via TOR

2014-01-12 Thread Gregory Maxwell
On Sun, Jan 12, 2014 at 6:36 PM, Jasper Deng jas...@jasperswebsite.com wrote:
 This question is analogous to the question of open proxies. The answer has
 universally been that the costs (abuse) are just too high.

No, it's not analogous to just permitting open proxies as no one in
this thread is suggesting just flipping it on.

I proposed issuing blind exemption tokens up-thread as an example
mechanism which would preserve the rate limiting of abusive use
without removing privacy.

 However, we might consider doing what the freenode IRC network does.
 Freenode requires SASL authentication to connect on Tor, which basically
 means only users with registered accounts can use it. The main reason for
 hardblocking and not allowing registered accounts on-wiki via Tor is that
 CheckUsers need useful IP data. But it might be feasible if we just force
 all account creation to happen on real IPs, although that still hides
 some data from CheckUsers.

What freenode does is not functionally useful for Tor users. In my
first hand experience it manages to enable abusive activity while
simultaneously eliminating Tor's usefulness at protecting its users.

The only value it provides is providing a pretext of tor support
without actually doing something good... and we already have the you
can get an IPblock-exempt (except you can't really, and if you do
it'll get randomly revoked. if all we want is a pretext. :)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Jake requests enabling access and edit access to Wikipedia via TOR

2013-12-30 Thread Gregory Maxwell
On Mon, Dec 30, 2013 at 6:10 PM, Tyler Romeo tylerro...@gmail.com wrote:
 On Mon, Dec 30, 2013 at 7:34 PM, Chris Steipp cste...@wikimedia.org wrote:

 I was talking with Tom Lowenthal, who is a tor developer. He was trying to
 convince Tilman and I that IP's were just a form of collateral that we
 implicitly hold for anonymous editors. If they edit badly, we take away the
 right of that IP to edit, so they have to expend some effort to get a new
 one. Tor makes that impossible for us, so one of his ideas is that we shift
 to some other form of collateral-- an email address, mobile phone number,
 etc. Tilman wasn't convinced, but I think I'm mostly there.


 This is a viable idea. Email addresses are a viable option considering they
 take just as much (if not a little bit more) effort to change over as IP
 addresses. We can take it even a step further and only allow email
 addresses from specific domains, i.e., we can restrict providers of
 so-called throwaway emails.

Email is pretty shallow collateral, esp if you actually allow email
providers which are materially useful to people who are trying to
protect their privacy.  Allowing e.g. only email providers which
require SMS binding, for example, would be pretty terrible... This is
doubly so because the relationship is discoverable: e.g. you only
really wanted to use the email to provide scarcity but because it was
provided it could be used to deanonymize the users. (Even if you
intentionally didn't log the email-user mapping, it would end up being
deanonymized-by-time in database backups; or could be secretly logged
at any time, e.g. via compromised staff)

FAR better than this can be done without much more work.

Digging up an old proposal of mine…

A proposal for more equitable access to ipblock-exempt.

In the Jake requests enabling access and edit access to Wikipedia via TOR
thread on 
wikitech-l[http://lists.wikimedia.org/pipermail/wikitech-l/2013-December/073764.html]
the issue of being able to edit Wikipedia via TOR was highlighted.

Some people appear to have mistaken this thread as being specifically about
Jake. This isn't so— Jake is technologically sophisticated and has access to
many technical and social resource. Jake-the-person can edit
Wikipedia, with suitable
effort. But Jake-as-a-proxy-for-other-tor-users has a much harder time.

Ipblock-exempt as implemented today doesn't— as demonstrated
[http://lists.wikimedia.org/pipermail/wikitech-l/2013-December/073773.html]
—even work for
Jake. It certainly doesn't work for more typical users.

Many people believe that Wikipedia has become so socially important that being
able to edit it— even if just to leave talk page comments— is an essential
part of participating in worldwide society. Unfortunately, not all people
are equally free and some can only access Wikipedia via anti-censorship
technology or can only speak without fear of retaliation via anonymity
technology.

Wikipedia must balance the interests of preventing abuse and enabling
the sharing of knowledge. Only so much can be accomplished by prohibiting
access to tor entirely: Miscreants can and do use paid VPNs and compromised
hosts to evade blocks on a constant basis. Ironically, abusive users who
are unconcerned about breaking the law have an easier time editing Wikipedia
then people simple concerned with unlawful surveillance. That isn't a
balance.

In order to better balance these interests, I propose the following
technical improvement:

A new special page should be added with a form which takes an
unblocked username and which
accepts a base64 encoded message which contains a random serial number and a RSA
digital signature with a well known Wikimedia controlled private key, we'll
call this message an exemption token. If the signature passes and the
serial number has
never been seen before, the serial number is saved, and Ipblock-exempt
is set on the account.

Additionally, the online donation process is updated with some client side JS so
that for every $10 donated the client picks a random value,
cryptographically blinds the random value
[https://en.wikipedia.org/wiki/Blind_signature#Blind_RSA_signatures.5B2.5D:235],
and submits the blinded values along with the donation. When the donation is
successful, the donation server signs the blinded values and returns them
and the clients unblind them and present the messages to the users.

[RSA blinding is no more complicated to implement than RSA signing in
general. It requires a modular exponentiation and multiply and a modular
inversion]

The donor is free to save the messages, give them out to friends, or press
some button to give them to the tor project. Each message entitles one
account to be exempted, and Wikimedia is unable to associate donations with
accounts due to the blinding.

Finally, the block notice should direct people to a page with instructions
on obtaining exemption tokens.

This process would provide a guaranteed bound on the amount of abusive
use of ipblock-exempt. 

Re: [Wikitech-l] Live stream from Wikimania 2010 about MediaWiki

2010-07-11 Thread Gregory Maxwell
On Sun, Jul 11, 2010 at 5:42 AM, Siebrand Mazeland s.mazel...@xs4all.nl wrote:
 Hi,
 Just to inform you about the NOW running live streams from Wikimania about 
 MediaWiki.
 See http://toolserver.org/~reedy/wikimania2010/jazzhall.html
 Runs until 13.00 CEST TODAY/NOW!

Shame. This requires some plugin stuff.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Live stream from Wikimania 2010 about MediaWiki

2010-07-11 Thread Gregory Maxwell
On Sun, Jul 11, 2010 at 6:29 AM, Erik Moeller e...@wikimedia.org wrote:
 I am hugely grateful that we have reliable streaming this year, thanks
 to a lot of volunteer effort. Perhaps we can defer the ideological
 nitpicking and just share that appreciation. I would be grateful even
 if it required a Windows-only plugin, which Flash is not.

I've been working from a non-x86 system the past couple of days. Even
if I wanted to install the proprietary flash software I couldn't.

The time delayed uploaded files worked pretty well last year, and I
was able to watch all the presentations I was interested in.  This
year I wasn't able to watch a single one.

This isn't merely ideology. But even if it were, ideology doesn't mean
without practical value, ideology can often mean preferring a
strategy believe to be practically superior over the long term in
preference to some short term expedience. I presume you pursue
long-term winning strategies over the best immediate gain constantly
through your life and don't consider these decisions to be
ideological, much less nitpicking.

I suppose it is valuable information to know that you have so little
respect for my opinions, though I would have preferred to learn of
this someplace other than on a public mailing list.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Reject button for Pending Changes

2010-06-27 Thread Gregory Maxwell
On Sun, Jun 27, 2010 at 2:48 PM, Rob Lanphier ro...@wikimedia.org wrote:
[snip]
 look at the revision history.  However, this should be reasonably rare, and
 the diff remains in the edit history to be rescued, and can be reapplied if
 need be.  A competing problem is that disabling the reject button will

Do you have a any data to support your rarity claim beyond the fact
that reviews spanning multiple revisions are themselves rare to the
point of non-existence on enwp currently?

Why is rarity a good criteria to increase the incidence of blind
reversion of good edits?   An informal argument here is that many
contributors will tell you that if their initial honest contributions
to Wikipedia had been instantly reverted they would not have continued
editing— and so extreme caution should be taken in encouraging blind
reversion unless it is urgently necessary.

Current review delays on enwp are very short what is the urgency for
requiring a mechanism for _faster_ reversions of edits which are not
being displayed to the general public?

Could the goal of reducing the unapprove button be equally resolved by
removing the unapprove button from the review screen where it is
confusingly juxtaposed with the approve button and instead display it
on the edit history next to the text indicating which revisions have
the reviewed state?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Broken validation statistics

2010-06-27 Thread Gregory Maxwell
Is anyone working on fixing the broken output from
http://en.wikipedia.org/wiki/Special:ValidationStatistics ?

I brought this up on IRC a week-ish ago and there was some speculation
as to the cause but it wasn't clear to me if anyone was working on
fixing it.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Reject button for Pending Changes

2010-06-27 Thread Gregory Maxwell
On Sun, Jun 27, 2010 at 6:04 PM, Rob Lanphier ro...@robla.net wrote:
 On Sun, Jun 27, 2010 at 12:12 PM, Gregory Maxwell gmaxw...@gmail.comwrote:

 On Sun, Jun 27, 2010 at 2:48 PM, Rob Lanphier ro...@wikimedia.org wrote:
 [snip]
  look at the revision history.  However, this should be reasonably rare,
 and
  the diff remains in the edit history to be rescued, and can be reapplied
 if
  need be.  A competing problem is that disabling the reject button will

 Do you have a any data to support your rarity claim beyond the fact
 that reviews spanning multiple revisions are themselves rare to the
 point of non-existence on enwp currently?



 I don't have that data.  However, let me put it another way.  We have a
 known problem (many people confused/frustrated by the lack of an enabled
 reject button), which we're weighing against a theoretical and currently
 unquantified problem (the possibility that an intermediate pending revision
 should be accepted before a later pending revision is rejected).  I don't
 think it's smart for us to needlessly disable this button in the absence of
 evidence showing that it should be disabled.

I think you've failed to actually demonstrate a known problem here.

The juxtaposition of the approve and unapproved can be confusing, I
agree. In most of the discussions where it has come up people appear
to have left satisfied once it was explained to them that 'rejecting'
wasn't a tool limited to reviewers— that everyone can do it using the
same tools that they've always used.

Or, in other words, short comings in the current interface design have
made it difficult for someone to figure out what actions are available
to them, and not that they actually have any need for more potent
tools to remove contributions from the site.

I think it's important to note that reverting revisions is a regular
editorial task that we've always had which pending changes has almost
no interaction with.  If there is a need for a one click
multi-contributor multi-contribution bulk revert why has it not
previously been implemented?

Moreover, you've selectively linked one of several discussions — when
in others it was made quite clear that many people (myself included,
of course) consider a super-rollback  undo everything pending button
to be highly undesirable.

Again— I must ask where there is evidence that we are in need of tools
to increase the _speed_ of reversion actions on pages with pending
changes at the expense of the quality of those determinations?   Feel
free to point out if you don't actually believe a bulk revert button
would be such a trade-off.


 The current spec doesn't call for blind reversion.  It has a confirmation
 screen that lists the revisions being reverted.

I don't think it's meaningful to say that a revert wasn't blind simply
because the reverting user was exposed to a list of user names, edit
summaries, and timestamps (particularly without immediate access to
the diffs).

A blind revert is a revert which is made without evaluating the
content of the change.   Such results are possible through the
rollback button, for example,  but they rollback is limited to the
contiguous edits by a single contributor.  Blind reverts can also be
done by selecting an old version and saving it, but that takes several
steps and the software cautions you about doing it.

The removal of rollback privileges due to excessively sloppy use is a
somewhat frequent event and the proposed change to the software is
even more risky.

These bulk tools also remove the ability to provide an individual
explanation for the removal of each of the independent changes.


 I think making accept/unaccept into a single toggling button is the
 right thing to do.

Because of page load times by the time I get the review screen up
someone has often approved the revision. If I am not maximally
attentive will I now accidentally unapprove a fine version of the page
simply because the button I normally click has reversed its meaning?

This doesn't seem especially friendly to me. Or, A user interface is
well-designed when the program behaves exactly how the user thought it
would, and this won't.

 Furthermore, because of the potentially confusing result
 of unaccepting something, I'd even recommend only making it possible when
 looking at the diff between the penultimate accepted revision and the latest
 accepted revision, which is documented in this request:
 http://www.pivotaltracker.com/story/show/3949176

That sounds good to me.  Though the review screen which you'd visit
with the intent of reviewing a change fits that description and if you
change the meaning of a commonly used button it will result in errors
of the form I just raised.


 However, I don't think that removes the need for a reject button, for
 reasons I outline here:
 http://flaggedrevs.labs.wikimedia.org/wiki/Wikimedia_talk:Reject_Pending_Revision


At the DC meetup yesterday someone used the explanation  Pending
changes is an approval of a particular

Re: [Wikitech-l] Reject button for Pending Changes

2010-06-27 Thread Gregory Maxwell
On Sun, Jun 27, 2010 at 9:59 PM, Gregory Maxwell gmaxw...@gmail.com wrote:
 Moreover, you've selectively linked one of several discussions — when
 in others it was made quite clear that many people (myself included,
 of course) consider a super-rollback  undo everything pending button
 to be highly undesirable.

Someone asked me off list to provide an example, so here is one:

http://en.wikipedia.org/wiki/Wikipedia_talk:Reviewing#What_gets_flagged_and_what_does_not

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Problem with the pending changes review screen.

2010-06-15 Thread Gregory Maxwell
Imagine an article with many revisions and pending changes enabled:
A, B, C, D, E, F, G...

A is an approved edit. B,C,D,E,F,G are all pending edits.

B is horrible vandalism that the subsequent edits did not fix.

You are a reviewer, you go to review page by clicking a pending review
link.  On the review page you can accept— thus putting the horrible
vandalism on the site. Or you can reject which throws out the all
the good edits of C,D,E,F,G by reverting it to A.

To quote someone from IRC: this seems like its going to make vandals
even more effective because all they have to do is make one edit in a
string of ten good ones, and then the entire set has to be thrown out

But that isn't true at all.   You're not confined to the review page,
you simply go to the edit history, click undo on B, and then approve
your own edit (it won't be auto-approved because G wasn't approved).
Tada.

This completely non-obvious to people, because the only options on the
review page are accept or reject, and it's already causing confusion.
  This is a direct result of the late in the process addition of the
review button, — trying to fit the round-peg of a revision reviewing
system (which we can't have because of the fundamental incompatibility
with single linear editing history) in to presentation-flagging system
square hole that we actually have.

I don't know how to fix this. We could remove the reject button to
make it more clear that you use the normal editing functions (with
their full power) to reject.  But I must admit that the easy rollback
button is handy there.   Alternatively we could put a small chunk of
the edit history on the review page, showing the individual edits
which comprise the span-diff (bonus points for color-coding if someone
wants to make a real programming project out of it) along with the
undo links and such.

In the meantime I expect enwp will edit the message text to direct
people to the history page for more sophisticated editing activities.


(Thanks to Risker for pointing out how surprising the pending review
page was for this activity)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Problem with the pending changes review screen.

2010-06-15 Thread Gregory Maxwell
On Tue, Jun 15, 2010 at 11:05 PM, Gregory Maxwell gmaxw...@gmail.com wrote:
 Imagine an article with many revisions and pending changes enabled:
 A, B, C, D, E, F, G...
[snip]
 I don't know how to fix this. We could remove the reject button to
 make it more clear that you use the normal editing functions (with
 their full power) to reject.  But I must admit that the easy rollback
 button is handy there.   Alternatively we could put a small chunk of
 the edit history on the review page, showing the individual edits
 which comprise the span-diff (bonus points for color-coding if someone
 wants to make a real programming project out of it) along with the
 undo links and such.
[snip]


Further discussion with Risker has caused me to realize that there is
another significant problem situation with the reject button.

Consider the following edit sequence:

A, B, C, D, E


A is a previously approved version.  B, and D are all excellent edits.
 C and E are obvious vandalism.  E even managed to undo all the good
changes of B,D while adding the vandalism.

A reviewer hits the pending revisions link in order to review, they
get the span diff from A to E.  All they see is vandalism, there is no
indication of the redeeming edits in the intervening span.  So they
hit reject.  The good edits are lost.


Unlike the prior problem, the only way to solve this would be only
display the REJECT button if all of the pending changes are by the
same author (or limiting it to only one pending change in the span,
which would be slightly more conservative but considering the
behaviour of the rollback button I think the group-by-author behaviour
would be fine).   The accept button is still safe.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Problem with the pending changes review screen.

2010-06-15 Thread Gregory Maxwell
On Tue, Jun 15, 2010 at 11:38 PM, Carl (CBM) cbm.wikipe...@gmail.com wrote:
 On Tue, Jun 15, 2010 at 11:30 PM, Gregory Maxwell gmaxw...@gmail.com wrote:
 Consider the following edit sequence:

 A, B, C, D, E

 A is a previously approved version.  B, and D are all excellent edits.
  C and E are obvious vandalism.  E even managed to undo all the good
 changes of B,D while adding the vandalism.

 The only way to handle this sort of thing is to actually look at the
 intermediate edits. I don't know if there is a nice way to simplify
 that workflow, but it points me towards the idea that reviewing should
 be done off the history page, not directly off a list of unreviewed
 pages.

This is how the software worked until recently. :(

I feel foolish for not catching this until now even though I was aware
of the addition of the reject button. Sorry.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Updating strings for FlaggedRevs for the Flagged Protection/Pending Revisions/Double Check launch

2010-05-22 Thread Gregory Maxwell
On Sat, May 22, 2010 at 2:13 PM, Rob Lanphier ro...@wikimedia.org wrote:
 Hi everyone,

 I'm preparing a patch against FlaggedRevs which includes changes that Howie
 and I worked on in preparation for the launch of its deployment onto
 en.wikipedia.org .  We started first by creating a style guide describing
 how the names should be presented in the UI:
 http://en.wikipedia.org/wiki/Wikipedia:Flagged_protection_and_patrolled_revisions/Terminology
[snip]

I'm concerned that the simplified graphical explanation of the process
fosters the kind of misunderstanding that we saw in the first slashdot
threads about flagged revision... particularly the mistaken belief
that the process is synchronous.

People outside of the active editing community have frequently raised
the same concerns on their exposure to the idea of flagged revisions.
Common ones I've seen Won't people simply reject changes so they can
make their own edits?  Who is going to bother to merge all the
unreviewed changes on a busy article, they're going to lose a lot of
contributions!

None of these concerns really apply to the actual implementation
because it's the default display of the articles which is controlled,
not the ability to edit. There is still a single chain of history and
the decision to display an article happens totally asynchronously with
the editing.

The illustration still fosters the notion of some overseeing
gatekeeper on an article expressing editorial control— which is not
the expected behaviour of the system, nor a desired behaviour,  nor
something we would even have the resources to do if it were desirable.
 In particular there is no per-revision analysis mandated by our
system:  Many edits will happen, then someone with the right
permissions will look at a delta from then-to-now and decide that
nothing is terrible in the current version and make it the displayed
version.   It's possible that there were terrible intermediate
versions, but it's not relevant.

I have created a poster suitable for distribution to journalists
http://myrandomnode.dyndns.org:8080/~gmaxwell/flagged_protection.png

(Though the lack of clarity in the ultimate naming has made it very
difficult to finalize it.  If anyone wants it I can share SVG/PDF
versions of it).

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Updating strings for FlaggedRevs for the Flagged Protection/Pending Revisions/Double Check launch

2010-05-22 Thread Gregory Maxwell
On Sat, May 22, 2010 at 5:09 PM, Gregory Maxwell gmaxw...@gmail.com wrote:
 I have created a poster suitable for distribution to journalists
 http://myrandomnode.dyndns.org:8080/~gmaxwell/flagged_protection.png

I have revised the graphic based on input from Andrew Gray and others.

http://myrandomnode.dyndns.org:8080/~gmaxwell/flagged_protection3.png

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Updating strings for FlaggedRevs for the Flagged Protection/Pending Revisions/Double Check launch

2010-05-22 Thread Gregory Maxwell
On Sat, May 22, 2010 at 8:17 PM, Rob Lanphier ro...@robla.net wrote:
 I suppose in this case, there might be a simpler debate about which is a
 better word: sighted, checked or accepted, since I think we actually
 have the same goal here (we don't want to convey anything other than
 someone other than an anonymous user gave this a once-over and thought it
 was ok to display).

Accepted might imply that revisions without that flag are not accepted.

This isn't actually the case.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] VP8 freed!

2010-05-20 Thread Gregory Maxwell
This is pretty far off topic, but letting fud sit around is never a good idea.

On Thu, May 20, 2010 at 2:08 AM, Hay (Husky) hus...@gmail.com wrote:
 http://x264dev.multimedia.cx/?p=377

 Apparently the codec itself isn't as good as H264, and patent problems
 are still likely. It's better than Theora though.

You should have seen what VP3 was like when it was handed over to
Xiph.Org.  The software was horribly buggy, slow, and the quality was
fairly poor (at least compared to the current status).

Jason's comparison isn't unfair but you need to understand it for what
it is—  he's comparing a very raw, hardly out of development, set of
tools to his own project— which is the most sophisticated and mature
video encoder in existence.  x264 contains a multitude of pure encoder
side techniques which can substantially improve quality and which
could be equally applied to VP8.  For an example of the kinds of pure
encoder side improvements available, take a look at the most recent
improvements to Theora:
http://people.xiph.org/~xiphmont/demo/theora/demo9.html

Even given that, VP8's performance compared to _baseline profile_
H.264 is good. Jason describes it as relatively close to x264’s
Baseline Profile.   Baseline profile H.264 is all you can use on the
if you actually want to be compatible with a great many devices,
including the iphone.

There are half research codecs that encode and decode at minutes per
frame and simply blow away all of this stuff. VP8 is more
computationally complex than Theora, but roughly comparable to H.264
baseline. And it compares pretty favourably with H.264 baseline, even
without an encoder that doesn't suck.This is all pretty good news.

On the patent part—  Simply being similar to something doesn't imply
patent infringement, Jason is talking out of his rear on that point.
He has no particular expertise with patents, and even fairly little
knowledge of the specific H.264 patents as his project ignores them
entirely.  Codec patents are, in general, excruciatingly specific — it
makes passing the examination much easier and doesn't at all reduce
the patent's ability to cover the intended format because the format
mandates the exact behaviour.  This usually makes them easy to avoid.
 It's easy to say that VP8 has increased patent exposure compared to
Theora simply by virtue of its extreme newness (while Theora is old
enough to itself be prior art against most of the H.264 pool),  but
I'd expect any problems to be in areas _unlike_ H.264 because the
similar areas would have received the most intense scrutiny. ... and
in any case, Google is putting their billion dollar butt on the line—
litigation involving inducement to infringe on top of their own
violation could be enormous in the extreme.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Vector skin not working on BlackBerry?

2010-05-13 Thread Gregory Maxwell
On Thu, May 13, 2010 at 3:16 PM, David Gerard dger...@gmail.com wrote:
 There's a few comments on the Wikimedia blog saying they can't access
 en:wp any more using their BlackBerry. Though we tried it here on an
 8900 and it works. Any other reports?

Punching in http://en.wikipedia.org/  as I normally would...

It starts to render, but with an enormous grey area at the top like a
gigantic banner ad.  Then the browser crashes, I assume I've never
seen it do that before... it throws up a there was a problem
rendering this page, blanks the screen, and goes unresponsive.

Blackberry 8310, software v4.5.0.110 (Platform 2.7.0.90)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Broken videos

2010-03-16 Thread Gregory Maxwell
On Tue, Mar 16, 2010 at 7:15 AM, Lars Aronsson l...@aronsson.se wrote:
 So how do I tell what's wrong? I have a laptop
 that is less than half a year old, a clean
 Ubuntu Linux 9.10 install and the included
 Firefox 3.5.8 browser. This should work, but
 these two videos never play more than two seconds
 and after a while my CPU fan spins up, firefox
 runs 100%, and all I can do is a kill -9,
 which kills any other work I had going in other
 browser windows and tabs.

You aren't running in a virtual machine are you?  Linux+VM is known as
a source of playback problems for firefox:
https://bugzilla.mozilla.org/show_bug.cgi?id=526080

Otherwise, it's pretty likely you're hitting
https://bugzilla.mozilla.org/show_bug.cgi?id=496147 (or another on of
several closely related linux audio specific bugs which are various
degrees of fixed in the latest firefox development).   I believe that
disabling pulseaudio will work around this collection of issues on
ubuntu.


On Tue, Mar 16, 2010 at 6:52 AM, Tei oscar.vi...@gmail.com wrote:
 Uh..  buffer overflow errors, complex file format loaders  in
 programming languages like C Or false assumptions about memory
 management with poor detection error and fatal consecuences.  Maybe
 even bad program intercomunication.  ...
 The internet was built on text based protocols to avoid these problems
 or help debug then.

Ironic that you say that... the variable length null terminated string
is probably the worst thing to ever happen to computer security.
Text does imply a degree of transparency, but it's not security
cure-all.

In any case, video and audio are in the same boat as Jpeg/png, +/-
some differences in software maturity.  There aren't any known or
expected malware vectors for them.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Broken videos

2010-03-16 Thread Gregory Maxwell
On Tue, Mar 16, 2010 at 12:27 PM, Tei oscar.vi...@gmail.com wrote:
 In any case, video and audio are in the same boat as Jpeg/png, +/-
 some differences in software maturity.  There aren't any known or
 expected malware vectors for them.
 Agreed. But seems possible to generate streams of video that crash the
 browser.  So.. probably autoplay is evil. (is already evil because is
 NSFW since distract coworkers )

Pegging the CPU on a fairly less than very common platform with a copy
of firefox which is soon to be outdated is probably not an enormous
worry.  Growing pains. Of course, it's useful to submit bug reports on
this stuff where ones don't already exist.

If you encounter files that break Firefox, Opera, Chrome,
Safari(+xiphqt)  please let me know and I'll make sure that a bug gets
reported.  I'm also happy to fix cortado (The java fallback for
clients without proper video support) bugs, — but Wikimedia is using a
copy of cortado so enormously old that it's not unlikely that any
problems encountered have already been fixed.

In any case, none of the video on wikimedia sits is autoplay in the
sense that it starts on its own.  The video tag itself is set to
autoplay, but the tag doesn't get inserted into the page until the
user clicks.  No video surprises.  (Unfortunately this process doesn't
give the video tag any chance to pre-buffer the video).

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] modernizing mediawiki

2010-03-03 Thread Gregory Maxwell
On Tue, Mar 2, 2010 at 11:30 PM, Chris Lewis yecheondigi...@yahoo.com wrote:
 I hope I am emailing this to the right group. My concern was about mediawiki 
 and it's limitations, as well as it's outdated methods. As someone wo runs a 
 wiki, I've gone through a lot of frustrations.

 If Wordpress is like Windows 7, then Mediawiki is Windows 2000. Very
 outdated GUI,

There are many, many, many skins available.

 outdated ways of doing things,
 for example using ftp to edit the settings of the wiki instead of having a

FTP ??!?   No. It's just a file.   Configuration files are considered
pretty reasonable and reliable by a lot of people. ::shrugs::


In any case…  It's Free Software, submit patches.


Cheers.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [MediaWiki-CVS] SVN: [62223] trunk/extensions/OggHandler

2010-02-10 Thread Gregory Maxwell
On Wed, Feb 10, 2010 at 8:38 AM, Tim Starling tstarl...@wikimedia.org wrote:
 That sounds like it needs a one-line fix in
 OggHandler::normaliseParams(), not 50 lines of code and a new decoder.
 Do you have a test file or a bug report or something?

Just switching the thumbnailer should be sufficient, I agree the pile
of code and the retries were fairly lame (and I think I complained
about it on IRC). I'm not sure why any support for thumbnailing ogv's
with ffmpeg was retained.

I don't see how you can fix it in normalizeParams call unless you've
scanned the stream and know where the keyframes are.  Ffmpeg could be
fixed, of course, but the ogg demuxer basically needs a rewrite... I
think the patch you did to ffmpeg a while back was a lot better than
the code they ultimately included.

Here is a file that won't thumbnail under the current code:
http://myrandomnode.dyndns.org:8080/~gmaxwell/theora/only_one_keyframe.ogv


ffmpeg -y -ss 5 -i only_one_keyframe.ogv -f mjpeg -an -vframes 1 foo.jpeg
throws a pile of errors then foo.jpeg is a zero byte file.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [MediaWiki-CVS] SVN: [62223] trunk/extensions/OggHandler

2010-02-10 Thread Gregory Maxwell
On Wed, Feb 10, 2010 at 9:36 PM, Tim Starling tstarl...@wikimedia.org wrote:
 Gregory Maxwell wrote:
 Looks like this change removed both the Oggthumb support as well as
 the code that handles the cases where ffmpeg fails.

 The usual problem with deploying new solutions for equivalent tasks is
 that you substitute known issues with unknown ones.

 I've looked at oggThumb now, I downloaded the latest tarball. Here are
 some of the ways in which it sucks:

So a couple months back mdale suggested using oggThumb, with the old
installed ffmpeg was making spaghetti of some files (old ffmpeg not
completely implementing the Theora spec) and the new one that
(whomever) tried installing mage spaghetti in a different way (failing
to thumb because ffmpeg didn't take your seeking patch eons ago).

I'd never heard of it, went to look, and recoiled in horror. Then I
sent a patch.

 * Unlike the current version of FFmpeg, it does not implement
 bisection seeking. It scans the entire input file to find the relevant
 frames. For an 85MB test file, it was 30 times slower than FFmpeg.

Of the issues I raised, seeking was the only I didn't fix.
Unfortunately oggvideotools reimplements libogg in C++ so it could use
C++ memory management, my patience ran out before I got around to
implementing it.

If you search the archive you can see how strongly opposed I am to
tools that linear scan unnecessarily. But 30x slower on a file that
small sounds a bit odd.

 * The output filename cannot be specified on the command line, it is
 generated from the input filename. OggHandler uses a -n option for
 destination path which just gives an error for me. I don't know if
 it's a patch or an alpha version feature, but it's not documented
 either way.

It's in SVN.

After the author of the package applied my patches (on the same day I
sent them) Mdale asked if he should delay Wikimedia deployment until
the fixes I sent in went in, the author offered to simply do a new
release. No one took him up on the author.


 * It unconditionally writes a progress message to stdout on every
 frame in the input file.

 * It unconditionally pollutes stderr with verbose stream metadata
 information.

 * It ignores error return values from libtheora functions like
 th_decode_packetin(), meaning that essentially the *only* thing on
 stdout/stderr is useless noise.

I'm also not especially keen on its rather non-unixy style. Then
again, I think C++ is pretty much crap too, so you can see what my
opinion is worth.  What I can say is that speaking from personal
experience the author of this package is friendly, pleasant to work
with, and responsive.  Though 'submit patches' takes me out of the
'one line fix' I advertised,  — sorry, I'd assumed that Mdale had
already worked out the operational angles and my only concerns were
correct output and not allowing it to be an enormous DOS vector.


Cheers.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [MediaWiki-CVS] SVN: [62223] trunk/extensions/OggHandler

2010-02-09 Thread Gregory Maxwell
On Wed, Feb 10, 2010 at 12:51 AM,  tstarl...@svn.wikimedia.org wrote:
 http://www.mediawiki.org/wiki/Special:Code/MediaWiki/62223

 Revision: 62223
 Author:   tstarling
 Date:     2010-02-10 05:51:56 + (Wed, 10 Feb 2010)

 Log Message:
 ---
 * In preparation for deployment, revert the bulk of Michael's unreviewed 
 work. Time for review has run out. The code has many obvious problems with 
 it. Comparing against r38714 will give you an idea of which changes I am 
 accepting. Fixes bug 22388.
 * Removed magic word hook, doesn't do anything useful.
 * OggPlayer.js still needs some work.

Looks like this change removed both the Oggthumb support as well as
the code that handles the cases where ffmpeg fails.

Ffmpeg will fail to generate a thumb if there is no keyframe in the
file after the point in time that you requested a thumb. This was
causing a failure to generate thumbs for many files because they are
short and only have a single keyframe at the beginning.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Theora video in IE? Use Silverlight!

2010-02-05 Thread Gregory Maxwell
On Fri, Feb 5, 2010 at 3:47 PM, Aryeh Gregor
simetrical+wikil...@gmail.com wrote:
 On Fri, Feb 5, 2010 at 3:39 PM, David Gerard dger...@gmail.com wrote:
 This is clever-ish:

 http://www.atoker.com/blog/2010/02/04/html5-theora-video-codec-for-silverlight/

 He says there that this will Just Work on ~40% of Windows boxes. Not bad.

 Cortado works wherever Java is installed, which is probably quite a
 lot more machines -- including Safari on Mac, for instance.  If we
 used anything non-Java, it would surely be Flash, which has much
 greater penetration than Silverlight on all platforms.

Yes, Cortado works in more places but there is no reason that BOTH
can't be used, extending support to places with silverlight but
without Java.

Additionally, although cortado will work the Java ~1.1 VM that came
with Navigator 4... it's rather slow except in the latest JVMs. I
expect that a lot of systems with silverlight are not running an
especially modern JVM.

Flash isn't something in the running because you still need to be
using encumbered media formats to use it... unless you're only playing
audio: There are several independent Vorbis implementations for the
flash virtual machine, no video codecs yet, and sadly the flash
architecture is no where near as nice as the silverlight one for
remote-loaded codecs so you have to completely reinvent all the media
infrastructure.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Theora video in IE? Use Silverlight!

2010-02-05 Thread Gregory Maxwell
On Fri, Feb 5, 2010 at 4:58 PM, David Gerard dger...@gmail.com wrote:
 On 5 February 2010 21:53, Gregory Maxwell gmaxw...@gmail.com wrote:

 Yes, Cortado works in more places but there is no reason that BOTH
 can't be used, extending support to places with silverlight but
 without Java.


 The thirty-second startup time of Java for Cortado makes it unusable,
 in my experience. Here's to Firefox 3.5.

Geesh. What JVM is this?  I just stop-watched it here on
http://myrandomnode.dyndns.org:8080/~gmaxwell/cortest/cortest1.html
and timed a bit over 3 seconds... fresh browser reload, no prior java
applets run, random 1.6ghz x86_64 laptop, and whatever JVM fedora 12
shipped with.

But yes, I'd hope and expect the silverlight stuff to load faster.

 Indeed. What's the performance of the Flash ActiveScript Theora
 decoder like? Horrible, or just bad?

I'm guessing you meant Vorbis, as there is no Theora port. I've not
benchmarked it, but it's supposedly a significant multiple of
realtime, but I think significant is something like 10x, which
doesn't bode well for a video codec implementation.

The testing I did with the C-flash compiler on another audio codec
convinced me that it could be made to work... though the performance
may not ultimately be satisfactory. (e.g. it may only work acceptably
on fast computers, at low resolutions, etc).  Although the flash vm
might be a lot faster by the time its done. I think it is somewhat
moot to speculate on it when it doesn't exist and, as far as I know,
no one is actively working on it.

In other news, There is some progress being made on an installable
video native code tag for IE.
(http://cristianadam.blogspot.com/2010/01/ie-tag.html  there should be
some more news on this in a few days)

On Fri, Feb 5, 2010 at 5:03 PM, Gerard Meijssen
gerard.meijs...@gmail.com wrote:
 Hoi,
 Providing support for Silverlight means that it needs to be tested tp ensure
 that the support remains stable. Silverlight does not really add value as
 far as I understand it. It competes with more open standards so reasons can
 be easily found not to support it. We have to invest in supporting
 Silverlight, the question is, how does it help us, our readers.

 We have a reputation that we support open standards ... so how open is
 Silverlight ?

David's post isn't about supporting silverlight it's about (ab)using
it to shim in support for open formats for IE users.

The current video infrastructure supports a half dozen different modes
of playback, maintaining one more would be work, but I think it would
have a decent value especially compared to some of the ones already
there (VLC plugin? oy)

As far as openness goes, see http://en.wikipedia.org/wiki/Novell_Moonlight

But I think it's quite reasonable to have different expectations for a
technology used as an openness shim. For example,  using flash
normally has the effect of promoting a proprietary-web but if you use
flash only as a canvas replacement for IE users it has a neutral or
the opposite long term effect.

To the best of my ability to tell, Silverlight is in a much stronger
openness position than Flash is, for whatever thats worth. Microsoft
has been rather giving and inclusive in this particular bid for world
domination. ;)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Flattening a wikimedia category

2010-02-04 Thread Gregory Maxwell
On Thu, Feb 4, 2010 at 6:40 PM, Tim Landscheidt t...@tim-landscheidt.de wrote:
 Is there any reason not to have a flatted structure some-
 where on the toolserver (or, in the long run, in MediaWiki)?
 A quick look at recentchanges for dewp shows about
 22000 changes per month, about one every two minutes. With
 about 8 categories in all, it should be feasible to up-
 date the structure incrementally, with daily/weekly/monthly
 clean new full dumps (or even dispense with up-to-the-se-
 cond data and just dump the flat structure hourly).

Incremental updates for a 'flattened copy' aren't especially
realistic... as one user operation can produce millions of operations
on the server.

I  won't bother saying much more, Daniel Schwen pretty much speaks for my view.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Facebook introducing PHP compiler?

2010-02-02 Thread Gregory Maxwell
On Tue, Feb 2, 2010 at 1:22 PM, Tei oscar.vi...@gmail.com wrote:
 I was thinking about that the other day, I understand why MediaWiki
 don't follow that route.


Mediawiki often runs in enviroments where users have no shell access,
no ability to install extensions, etc.

There is some C++ stuff for mediawiki, such as wikidiff3 but its optional.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Google phases out support for IE6

2010-02-01 Thread Gregory Maxwell
On Mon, Feb 1, 2010 at 1:28 PM, David Gerard dger...@gmail.com wrote:
 On 1 February 2010 15:43, Aryeh Gregor simetrical+wikil...@gmail.com wrote:
 On Mon, Feb 1, 2010 at 10:14 AM, Thomas Dalton thomas.dal...@gmail.com 
 wrote:

 It's not just the clutter, though, it's the effort of maintaining it.

 I don't suggest we maintain it.  Just leave it alone.  If other
 changes happen to cause IE5 to break, then remove it, but don't remove
 *existing* IE5 support as long as IE5 still happens to work with no
 extra effort on our part.


 Yes. If someone actually notices something bitrotting and they tell
 us, that's excellent. If they don't, there you go.

 That said, there must be *someone* on this list bloody-minded enough
 to test Wikipedia in every possible browser and file bugs and patches
 accordingly ...

It shouldn't be a question of bloddy-mindedness.  The rotting of
support for a single browser version would potential shut out many
tens of thousands of users.  It's something worth dedicating some
resources to.

Simply verifying functionality with all the *popular* browsers and
platforms is already burdensome. Doing it well (and consistently)
requires some infrastructure, such as a collection of virtualized
client machines. Once that kind of infrastructure is in place and well
oiled the marginal cost of adding a few more test cases should not be
especially great.

The core of Wikipedia functionality is plain text with a smattering of
images in common formats. I can think of no reason that this basic
reading functionality for IE 5.x and the like should go away for the
foreseeable future but if nothing else, knowing that it doesn't work
would be a good thing.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Google phases out support for IE6

2010-02-01 Thread Gregory Maxwell
On Mon, Feb 1, 2010 at 6:31 PM, Schneelocke schneelo...@gmail.com wrote:
 Maybe we should do the same - introduce bugs that will cause subtle
 breakages on browsers we'd rather not go out of our way to
 specifically support any longer, and see if anyone'll actually
 complain. :)

People are really bad at complaining, especially web users.  We've had
prolonged obvious glitches which must have effected hundreds of
thousands of people and maybe we get a couple of reports.

Users appear to just hit the back button and move on, either they
don't care at all or they do care but assume it will be fixed without
their intervention.

What you propose is not a good policy, at least not in this application space.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Google phases out support for IE6

2010-01-31 Thread Gregory Maxwell
On Sun, Jan 31, 2010 at 6:34 PM, John Vandenberg jay...@gmail.com wrote:
 Even then, there is
 http://www.askvg.com/download-mozilla-firefox-30-portable-edition-no-installation-needed/

 Excuse me?  please read the earlier posts in this thread.

 I am talking about IE for Mac Classic.

 iCab support?  Is Classilla a sensible replacement for people still
 using IE for Mac?  etc.

I couldn't get classzilla running on a blue and white G3 running 9.0.2
when I tried it a couple months ago.

I have a couple of these systems for driving some embedded hardware
that never got moved to anything more modern, they'd be perfectly
adequate systems for webbrowsing if you could get a workably up to
date webbrowser on them: The IE the OS ships with hard locks the
machine on apple.com of all places!  I was only bothering to attempt
this because I wanted to get a screenshot of cortado playing videos on
something very old, and I only spent an hour or so on it. (Wikipedia,
OTOH, worked fine with the IE that comes with the OS on those systems)


But seriously. Outright *excluding* these old things shouldn't even be
a consideration. Even a very small audience (like 0.02%) is tens of
thousands of readers. Mediawiki (and the WMF deployment) already has
many features which don't work / don't work well on fairly old
systems, so that bridge has already been crossed, but outright
dropping support for basic use?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Gregory Maxwell
On Thu, Jan 14, 2010 at 10:47 AM, Magnus Manske
magnusman...@googlemail.com wrote:
 Suggestion :
 * log search and SHA1 IP hash (anonymous!)

*Any* mapping of the IP is not anonymous. Please see the AOL search
results where unique IDs were connected between searches to disclose
information.   (More over a straight simple hash of an IP can be
reversed simply by making a table of all expected IPs)

However: Since this is just for internal logging there is no need to
hash the IP.  Just log it directly, and thus avoid the risk that
someone later will think the hash is something which can be disclosed.


 * search queries are logged in a standardized fashion (for grouping),
 e.g. lowercase, single spaces, no leading/trailing spaces, special
 chars converted to spaces, etc.

Excellent.

 * display searches per week (?) that have been searched for at least
 10 times from at least 5 different IP hashes (to avoid people
 searching their own name 100 times...)

What I've suggested elsewhere was at least 4 different IPs, 5 sounds
fine to me too.  I don't know that the minimum of 10 queries matters
once the 5 IP check is in place.

Per week would be okay. No shorter though.


If someone gives me a log format, I'll gladly write a fast tool for
producing this output.
(I did something like that before where I gave Brion a tool to produce
stats from access logs)

I think I have a C code for a parser for wikimedia's squid logs... so
if its just that I already have a good chunk of it done.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Gregory Maxwell
On Thu, Jan 14, 2010 at 11:01 AM, David Gerard dger...@gmail.com wrote:
 2010/1/14 Bryan Tong Minh bryan.tongm...@gmail.com:
 On Thu, Jan 14, 2010 at 4:47 PM, Magnus Manske
 magnusman...@googlemail.com wrote:

 * log search and SHA1 IP hash (anonymous!)

 There are only 2 billion unique addresses and they can all be found in
 half an hour probably.


 A count of search terms, with no IP info at all? Would be more useful
 than nothing.

 (modulo the issue Michael Snow raised re: searches on suppressable names)

Magnus was not suggesting disclosing the IP hash, as far as I can
tell. He demonstrating an abundance of caution in suggesting only
logging that. (er, well, yea, if he was suggesting disclosing that...
we shouldn't do that.  Even if we add a secret to the hash, it's risky
and allows interesting correlation attacks)


Here is what I would suggest disclosing:

#start_datetime end_datetime hits search_string
2010-01-01-0:0:4 2010-01-13-23-59-50 39284 naked people
2010-01-01-0:0:4 2010-01-13-23-59-50 23950 hot grits
...
2010-01-01-0:0:4 2010-01-13-23-59-50 5 autoerotic quantum chromodynamics


Which has first been filtered by:
* Canonicalization of strings (at least ascii case folding)
* Excluding strings over some length
* Excluding searches which did not come from at least 5 distinct IPs
during the reporting interval



There will be useful information excluded by this process, e.g. gads
of misspellings which came from only two to four unique IPs... but the
output would still be *far* more useful no information at all.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Gregory Maxwell
On Thu, Jan 14, 2010 at 11:15 AM, Gregory Maxwell gmaxw...@gmail.com wrote:
 Here is what I would suggest disclosing:
 #start_datetime end_datetime hits search_string
 2010-01-01-0:0:4 2010-01-13-23-59-50 39284 naked people
 2010-01-01-0:0:4 2010-01-13-23-59-50 23950 hot grits
 ...
 2010-01-01-0:0:4 2010-01-13-23-59-50 5 autoerotic quantum chromodynamics

The logs are probably combined across wikis, so I'd change that to

#start_datetime end_datetime projectcode hits search_string
2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikipedia 39284 naked people
2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikipedia 23950 hot grits
...
2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikipedia 5 autoerotic quantum
chromodynamics
2010-01-01-0:0:4 2010-01-13-23-59-50 de.wikipedia 25093 Bondage 
Disziplin Pokémon
...
...
2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikinews 5 ethics in journalism

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Gregory Maxwell
On Thu, Jan 14, 2010 at 12:22 PM, Conrad Irwin
conrad.ir...@googlemail.com wrote:
 Wiktionary is case-sensitive and so case-folding there may not be
 appropriate; I personally would be interested in seeing these logs
 before even the NFC normalizers get to them (given a lack of any other
 source to find out how people type fun characters in the wild) though I
 can appreciate this is somewhat sadistic, and probably the logs are
 taken too late for this.

 It would not be too much work to publish a set of post-processing
 scripts that could perform those normalisations that people are
 interested in; I don't think any two people will agree exactly on what

You've missed the point of the normalization here.  It's not to be
helpful to users: As you observe, it's easy for the recipient of the
list to perform their own.   The reason to normalize is to push more
queries above the reporting threshold.  For example, 5 people might
search for john f. kinndey (a misspelling of John F. Kennedy?) but
all capitalize it differently. A redirect on this misspelling would be
useful regardless of the case.

All things equal I'd rather *not* normalize the data... it's just more
stuff that may have surprising behaviour. But I think this is
something which may need to be balanced against the disclosure
threshold.

It would also be possible to do the disclosure calculation against
normalized data while releasing the raw values... but I must admit a
little bit of uneasiness that the normalization might be ignoring some
piece of information relevant to privacy.

For example, if we were to go that route we might employ some fairly
aggressive normalization... removing all whitespace and punctuation.
If we went as far as also removing all *numbers* from the check we'd
run into things like Greg Maxwell (555)-555-1212 getting published
because enough distinct people searched for greg maxwell.  Obviously
the answer to that one is don't remove numbers from the check, but I
worry about the cases I haven't thought of.

On Thu, Jan 14, 2010 at 12:51 PM, Aryeh Gregor
simetrical+wikil...@gmail.com wrote:
 Some people might search for their own name more than five times in a
 week, possibly together with other embarrassing or incriminating
 search terms.

Yes, it's possible that someone may search 5 times, from 5 IPs (which
*might* be from one machine due to proxy round-Robbins), an identical
string ... MyFullName seen on friday night with a woman other than
his wife ... but what to do?

Any information which is disclosed has some risk of disclosing
something that someone would rather not be. This risk can be made
arbitrarily small, but it can't be eliminated.

I think the benefit to the readers of having this information
available easily outweighs some sufficiently fringe confidentiality
concern.  At some point your frequently repeated search is a
statistic, which no reasonable privacy policy would frown on
disclosing.

This is important to our operations, disclosing it is in the public
interest, and failing to do work in this area puts us at a
disadvantage compared to other parties who might be far less
scrupulous.  (e.g. If WMF's search performs poorly, you might feel
compelled to use Search Engine X — which happens to secretly sell your
data to the highest bidder.)

Is there some sufficiently high number which *no one* paying attention
here has a concern about?  We could simply start with that and
possibly lower the threshold over time as the lowest hanging fruit are
solved, tracking our disclosure comfort.

I think we all have an interest and obligation to take every
reasonable means, but no one can ask for more than that.

Would anyone feel more comfortable if this ignored queries made via
the secure server?  Non-HTTPS traffic can be watched by anyone on the
path between you and Wikimedia... any illusion of absolute privacy on
the insecure traffic is patently false already.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Gregory Maxwell
On Thu, Jan 14, 2010 at 6:32 PM, Platonides platoni...@gmail.com wrote:
 Sampled search logs are unlikely to reveal them though, since what they
 are repeating are the non-keywords, not the full query.

Sampling is fine, but aggregated logs aren't likely to… thats the
primary reason for reporting things other than the topmost queries.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Wikimedia crosses 10Gbit/sec

2010-01-11 Thread Gregory Maxwell
Today Wikimedia's world-wide five-minute-average transmission rate
crossed 10gbit/sec for the first time ever, as far as I know. This
peak rate was achieved while serving roughly 91,725 requests per
second.

This fantastic news is almost coincident with Wikipedia's 9th
anniversary on January 15th.
[http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Day ]

In casual units, a rate of 10gbit/sec is roughly equivalent to 5 of
the US Library of Congress per day (using the common 1 LoC = 20 TiB
units).  Wikimedia's 24 hour average transmission rate is now over
5.4gbit/sec, or 2.6 US LoC/day.

A snapshot of the traffic graph on this historic day can be seen here:
http://commons.wikimedia.org/wiki/File:2010-01-11_wikimedia_crosses_10gbit.png

Ten years ago many traditional information sources were turning
electronic, and possibly locking out the unlimited use previously
enjoyed by public libraries. It seemed to me that closed pay-per-use
electronic databases would soon dominate all other sources of factual
information. At the same time, the public seemed to be losing much of
its interest in the more intellectually active activities such as
reading.  So if someone told me then that within the decade one of the
most popular websites in the world would be a free content
encyclopedia, consisting primarily of text, or that the world would
soon be consuming over 50 terabytes of compressed educational material
per day—I never would have believed them.

The growth and success of the Wikimedia projects is an amazing
accomplishment, both for the staff and volunteers keeping the
infrastructure operating efficiently as well as the tens of thousands
of volunteers contributing this amazing corpus.  This success affirms
the importance of intellectual endeavours in our daily lives and
demonstrates the awesome power of people working together towards a
common goal.

Congratulations to you all.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] IE8 Compatibility View

2010-01-11 Thread Gregory Maxwell
On Mon, Jan 11, 2010 at 7:18 PM, Aryeh Gregor
simetrical+wikil...@gmail.com wrote:
 On Mon, Jan 11, 2010 at 6:36 PM, Mike.lifeguard
 mike.lifegu...@gmail.com wrote:
 Microsoft has informed us with an email to OTRS (#201000039819) that
 wikimedia.org (and presumably our other domains) will be removed from

 Why would you presume that?

 the Compatibility View List for Internet Explorer 8 near the end of
 January 2010.

 I don't know why we were ever on it.  We always marked our IE7 fixes
 with if IE 7 and not if IE gt 7, right?  IE8 should have been
 getting good CSS2.1 from the get-go.


http://www.microsoft.com/downloads/details.aspx?familyid=B885E621-91B7-432D-8175-A745B87D2588displaylang=en

There is an XLS file here indicating that wikimedia.org is pending
removal, but the other domains are not. (The email appears to be
wikimedia.org specific).

Someone should probably take all the WMF domains on that list and
request that they all be removed.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] search ranking

2010-01-10 Thread Gregory Maxwell
On Sun, Jan 10, 2010 at 5:50 PM, Robert Stojnic rainma...@gmail.com wrote:
 So we got some new search servers (thanks MarkRob) and I have deployed
 them today. As a consequence, the search limit is now re-raised to 500
 and interwiki search is back on all wikis. I would still however like to
 keep srmax on 50 for API because there seems to be quite a number of
 broken bots and people experimenting...

 Additionally, I've switched mwsuggest to lucene backend, so now the AJAX
 suggestions are no longer alphabetical but ranked according to number of
 links to them (and some CamelCase and such redirects are not shown).
 This has been active on en.wp for a while, but now it's on all wikis.

 If you see things broken please find me on IRC, or leave a message on my
 en.wp talk page.

If anyone feels adventurous:

http://www.joachims.org/publications/joachims_02c.pdf

http://www.cs.cornell.edu/People/tj/svm_light/svm_rank.html

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] search ranking

2010-01-10 Thread Gregory Maxwell
On Sun, Jan 10, 2010 at 9:52 PM, William Pietri will...@scissor.com wrote:
 On 01/10/2010 06:12 PM, Gregory Maxwell wrote:
 If anyone feels adventurous:

 http://www.joachims.org/publications/joachims_02c.pdf

 http://www.cs.cornell.edu/People/tj/svm_light/svm_rank.html


 Ooh, that looks fun. If I wanted to investigate, I'd start here, yes?

 http://svn.wikimedia.org/svnroot/mediawiki/branches/lucene-search-2.1/

 Is the click data available, too?

It's not— but progress on this subject would probably be a good
justification for making some available.

Without the click data available, I'd suggest simply using the
stats.grok.se page view data: It won't allow the system to learn how
preferences change as a function of query text, but it would let you
try out all the machinery.

I'd expect that static page popularity would be the obvious fill-in
data you'd use where click through information is not available, in
any case.  So for example,  If query X returns  A,B,C,D,E  and you
only know the user clicked B then you can assume that B[A,C,D,E], but
by mixing in the static popularity you can could also decide that
BDEAC (because d,e,a,c is the popularity of the remaining pages).


In order to use this kind of predictive modelling you need to create
some feature extraction. Basically, you take your input and convert it
into a feature-vector: a multidimensional value which represents the
input as a finite set of floating point numbers which (hopefully)
exposes relevant information and ignores irrelevant information.

I've never used rank-svm before, but for text classification with SVM
it is pretty common to use the presence of words to construct a sparse
vector. E.g. after stripping out markup every input word (or work
pair, or word fragment or...) gets assigned a dimension.  The vector
for a text has the value 1. in that dimension if the text contains the
word, 0 if it doesn't.

So, the blue cat might be [14:1.0 258:1.0 982:1.0], presuming that
the was assigned dimension 14, blue 258, cat 982. The zillion other
possible dimensions are zero. Typical linear SVM classifiers work
reasonable well on highly sparse data like this, even if there are
hundreds of thousands of dimensions.

Full text indexers like lucene also do basically the same kind thing
internally, usually after some folding/stemming (i.e.
[girls,gals,dames,female,lady,girl,womens] - women) and elimination
of common words (e.g. the), so the lucene tools may already be doing
most or all of the work you'd need for basic feature extraction.

It looks like for this rank SVM I'd run the feature-extraction on both
the query and the article and combine them into one vector for the
SVM.  For example, you could do something like assign a different
value for the word dimension (i.e. 2 if a word is in both vectors, -1
if its in the query but not the article, 0.5 if its only in the
article... etc), or give query-words different dimension values than
article words (i.e. if you're tracking 100,000 words, add 100,000 to
the query word dimension numbers).I have no clue which of the
infinite possible ways would work best, there may be some suggestions
in the literature but there is no replacement for simply trying a lot
of approaches.

95% of the magic in making machine learning work well is coming up
with good feature extraction.  For Wikipedia data in addition to the
word-existence metric which is often used for free-text the presence
of categories (i.e. each cat is mapped to a dimension number), and
link structure information (perhaps different values for words which
are linked?, only using wikilinked words as the article keys) are
obvious things which could be added.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Gregory Maxwell
On Fri, Jan 8, 2010 at 10:56 AM, Aryeh Gregor
simetrical+wikil...@gmail.com wrote:
 On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmor...@shaw.ca wrote:
 I am not sure about the cost of the bandwidth, but the wikipedia image dumps 
 are no longer available on the wikipedia dump anyway.  I am guessing they 
 were removed partly because of the bandwidth cost, or else image licensing 
 issues perhaps.

 I think we just don't have infrastructure set up to dump images.  I'm
 very sure bandwidth is not an issue -- the number of people with a

Correct. The space wasn't available for the required intermediate cop(y|ies).

 terabyte (or is it more?) handy that they want to download a Wikipedia
 image dump to will be vanishingly small compared to normal users.

s/terabyte/several terabytes/  My copy is not up to date, but it's not
smaller than 4.

 Licensing wouldn't be an issue for Commons, at least, as long as it's
 easy to link the images up to their license pages.  (I imagine it
 would technically violate some licenses, but no one would probably
 worry about it.)

We also dump the licensing information. If we can lawfully put the
images on website then we can also distribute them in dump form. There
is and can be no licensing problem.

 Wikipedia uses an average of multiple gigabits per second of
 bandwidth, as I recall.

http://www.nedworks.org/~mark/reqstats/trafficstats-daily.png

Though only this part is paid for:
http://www.nedworks.org/~mark/reqstats/transitstats-daily.png

The rest is peering, etc. which is only paid for in the form of
equipment, port fees, and operational costs.

 The sensible bandwidth-saving way to do it would be to set up an rsync
 daemon on the image servers, and let people use that.

This was how I maintained a running mirror for a considerable time.

Unfortunately the process broke when WMF ran out of space and needed
to switch servers.

On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmor...@shaw.ca wrote:
 Bittorrent is simply a more efficient method to distribute files,

No. In a very real absolute sense bittorrent is considerably less
efficient than other means.

Bittorrent moves more of the outbound traffic to the edges of the
network where the real cost per gbit/sec is much greater than at major
datacenters, because a megabit on a low speed link is more costly than
a megabit on a high speed link and a megabit on 1 mile of fiber is
more expensive than a megabit on 10 feet of fiber.

More over, bittorrent is topology unaware so the path length tends to
approach the internet average mean path length. Datacenters tend to be
more centrally located topology wise, and topology aware distribution
is easily applied to centralized stores. (E.g. WMF satisfies requests
from Europe in europe, though not for the dump downloads as there
simply isn't enough traffic to justify it)

Bittorrent also is a more complicated, higher overhead service which
requires more memory and more disk IO than traditional transfer
mechanisms.

There are certainly cases where bittorrent is valuable, such as the
flash mob case of a new OS release. This really isn't one of those
cases.

On Thu, Jan 7, 2010 at 11:52 AM, William Pietri will...@scissor.com wrote:
 On 01/07/2010 01:40 AM, Jamie Morken wrote:
 I have a
 suggestion for wikipedia!!  I think that the database dumps including
 the image files should be made available by a wikipedia bittorrent
 tracker so that people would be able to download the wikipedia backups
 including the images (which currently they can't do) and also so that
 wikipedia's bandwidth costs would be reduced. [...]


 Is the bandwidth used really a big problem? Bandwidth is pretty cheap
 these days, and given Wikipedia's total draw, I suspect the occasional
 dump download isn't much of a problem.

 Bittorrent's real strength is when a lot of people want to download the
 same thing at once. E.g., when a new Ubuntu release comes out. Since
 Bittorrent requires all downloaders to be uploaders, it turns the flood
 of users into a benefit. But unless somebody has stats otherwise, I'd
 guess that isn't the problem here.

We tried BT for the commons poty archive once while I was watching and
we never had a downloader stay connected long enough to help another
downloader... and that was only 500mb, much easier to seed.

BT also makes the server costs a lot higher: it has more cpu/memory
overhead, and creates a lot of random disk IO.  For low volume large
files it's often not much of a win.

I haven't seen the numbers for a long time, but when I last looked
download.wikimedia.org was producing fairly little traffic... and much
of what it was producing was outside of the peak busy hour for the
sites.  Since the transit is paid for on the 95th percentile and the
WMF still has a decent day/night swing out of peak traffic is
effectively free.  The bandwidth is nothing to worry about.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org

Re: [Wikitech-l] Redirect disclosure on hover

2009-12-06 Thread Gregory Maxwell
On Sun, Dec 6, 2009 at 6:56 PM, John Doe phoenixoverr...@gmail.com wrote:
 or a simpler method would be to use a javascript tool like I use which
 was created by lupin called popups which can actually get the redirect
 target page show the first picture and first paragraph on mouse hover

You have a weird definition of simpler. :) Thousands of lines of JS code
and an additional HTTP request per link isn't simple in my book. :)

Though the popups tool does provide a number of other advantages which
justify its load and complexity... are you aware of any large
mediawiki install which have this tool activated by default (i.e.
for anons?)


On Sun, Dec 6, 2009 at 6:45 PM, Aryeh Gregor
simetrical+wikil...@gmail.com wrote:
 Caching is one problem here.  Another is that you need to reliably
 generate the redirected from link somehow, so that redirects are
 maintainable.  You don't want an editor to click a link, arrive at a
 totally different page (maybe via an inappropriate redirect), and have
 no idea how they got there.

hm. This could be resolved by mixing in the URL stuffing alternative,
linking to /target#from_redirectname then letting client side JS code
generate the redirect back-link.

 The job queue is already horribly overloaded, I don't think adding
 more things to it would be a good thing.

Right… though it's not especially harmful if this information is stale so
there is the possibility of simply letting it be stale.

...but workqueue workload is why I waved my arms about request merging and
priority  queueing.  I'd expect that the actual additional work in fixing up
redirect destination changes would be pretty negligible if the entries
were handled at a lower priority and eliminated whenever their task
was completed as a side effect of some other change.

 On the other hand, since
 this particular change doesn't affect anything visible to templates or
 such, you wouldn't have to reparse the whole page to update it, in
 principle.
[snip]

Good point.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Usability initiative (HotCatreplacement/improvements etc.)

2009-09-16 Thread Gregory Maxwell
On Wed, Sep 16, 2009 at 5:24 PM, Jared Williams
jared.willia...@ntlworld.com wrote:
 Can distribute them across multiple domain names,
 thereby bypassing the browser/HTTP limits.

 Something along the lines of
 'c'.(crc32($title)  3).'.en.wikipedia.org'

 Would atleast attempt to download upto 4 times as many things.

Right, but it reduces connection reuse. So you end up taking more TCP
handshakes and spend more time with a small transmission window. (plus
more DNS round-trips; relevant because wikimedia uses low TTLs for
GSLB reasons) TNSTAAFL.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Not allowing certain external link types?

2009-09-05 Thread Gregory Maxwell
On Sat, Sep 5, 2009 at 4:28 PM, David Gerarddger...@gmail.com wrote:
 Although his actions were IMO dickish, he has some point: is there any
 reason to allow .exe links on WMF sites? Is there a clean method to
 disable them? Is this a bad idea for any reason? What should default
 settings be in MediaWiki itself? etc., etc.

http://markmail.org/message/6zsebtdrahmwzs3s

What once was rubbish is no more? :)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] flaggedrevs.labs.wikimedia.org Status?

2009-09-01 Thread Gregory Maxwell
On Tue, Sep 1, 2009 at 5:03 AM, K. Peacheyp858sn...@yahoo.com.au wrote:
 On Tue, Sep 1, 2009 at 5:39 PM, Gregory Maxwellgmaxw...@gmail.com wrote:
 Seems my concern was moot in any case... Every time I loaded it I've
 only seen trashed pages like this:
 http://flaggedrevs.labs.wikimedia.org/wiki/Super_Smash_Bros._Melee

 But I guess this is just a result of the import being incomplete.
 It's not trashed it's just missing templates and possibly css
[snip]

Um... Read the thread plz. :)

And it's different now than it was last night, last night the
templates weren't there yet and it looked like a car hit it. :)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] flaggedrevs.labs.wikimedia.org Status?

2009-09-01 Thread Gregory Maxwell
On Tue, Sep 1, 2009 at 7:17 PM, K. Peacheyp858sn...@yahoo.com.au wrote:
 On Wed, Sep 2, 2009 at 7:02 AM, Platonidesplatoni...@gmail.com wrote:
 You know, when you point to a broken page, people^W wikipedians tend to
 do absurd things like fixing them :)
 I was going to fix some up, but import is restricted and i was too
 lazy to do copy/paste imports.

Ehhh.  It don't know that it makes sense to spend effort manually
fixing pages on a test project.  If the import procedure is not
working right it should be improved...


In any case, I'm sorry for the tangent. The main intent of my post was
to determine the current status:

Is the import finished?
When will the configuration changes for flagged protection be turned on?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] flaggedrevs.labs.wikimedia.org Status?

2009-08-31 Thread Gregory Maxwell
Greetings.

Can anyone provide a status update regarding flaggedrevs.labs.wikimedia.org ?

In the future perhaps it would be better to import simple english
Wikipedia for enwp testing: The lack of templates makes the site look
extensively vandalized already. I'm guessing that an alternative
english language project would be more useful than a subset of enwp.
:)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Wikipedia iPhone app official page?

2009-08-29 Thread Gregory Maxwell
On Sat, Aug 29, 2009 at 9:07 AM, Dmitriy Sintsovques...@rambler.ru wrote:
 Some local coder told me that GIT is slower and consumes much more RAM
 on some operations than SVN.
 I can't confirm that, though, because I never used GIT and still rarely
 use SVN. But, be warned.


I laughed at this... GIT has a number of negatives, but poor speed is
not one of them especially if you're used to working with SVN and a
remote server.  Maybe this is just a windows issue? GIT leaves a lot
of work to the filesystem.

My primary complaint with GIT is that if you're doing non-trivial tree
manipulation it's not at all difficult to convert your tree into Swiss
Cheese and it can be fairly difficult to fix it other than by pulling
a copy from an unscrewedup replica and cherry pick your later changes
back into it. OTOH, the sorts of tree uber-bonsai likely to result in
a shredded tree are pretty much not possible in SVN. YMMV.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Wikipedia iPhone app official page?

2009-08-29 Thread Gregory Maxwell
On Sat, Aug 29, 2009 at 6:48 PM, Marco
Schusterma...@harddisk.is-a-geek.org wrote:
 And so to the disk. If the disk or the controller sucks or is simply old
 (not everyone has shiny new hardware), you're also damn slow. What should
 also not be underestimated is the diskspace demand of a GIT repo - not

On most projects I'm working on, even ones with long histories, the
git repo is around the same size as a checkout and on many it s
smaller. Of course, you'll also need a checkout in order to do useful
work with it, but doubling the storage isn't usually a big deal.

If you're the sort of person who does development using a whole lot of
separate local trees git can use the same storage to provide history
for all of them, even when the trees are partially divergent.

DVCS is especially useful on a laptop because you can perform useful
version control while disconnected from the internet.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] please make wikimedia.org mailing lists searchable

2009-08-24 Thread Gregory Maxwell
On Mon, Aug 24, 2009 at 1:16 AM, jida...@jidanni.org wrote:
 Why have each user jump through such hoops, and still leave this door
 open to the the bad guys whoever they are.
[snip]

If you wish to have a productive discussion with people you'll be most
successful if you try to understand and empathize with their concerns,
so that you can find a solution which satisfies everyone. You won't go
far with scare-quoted phrases like the bad guys and hyperbole like
held for ransom and North Korean style.

The current behaviour was established as the result of experience:
It's not something that was done speculatively, but as a solution to
real problems which were occurring.  Removing messages from archives
was found to be time-consuming and ineffective because once out the
removal often did nothing. The annoying of dealing with it was
magnified because it had to be done by someone with shell access and
because it was, naturally, always urgent.

People make mistakes, both the clicked the wrong button type and the
failed to consider the consequence type, and people often play fast
and loose with other people's privacy. As an example— an issue we've
had in the past is people responding with private details to a message
which included a public list buried in its carbon-copy chain.  So
admonishing be more careful really doesn't solve it:  The lack of
google indexing is intended to address the cases where be careful
failed.

The intent isn't to stop people from searching for information in the
lists, which would be an impossible goal, but to prevent material from
the lists from showing up at the top of google when people perform
random searches for various people's names and to make removals
actually effective. So the availability of archive files is not a
problem.

Perhaps this is more of a problem for the Wikimedia Lists than many
others due to the high search placement of the Wiki(p|m)edia sites in
general. I think the comparison to LKML is entirely inappropriate: not
only can you make an entirely different set of assumptions about the
users technical prowess but LKML is open for posting to
non-subscribers … the level of SPAM received through it in the past
has exceeded the volume of some of our lists, its like arguing that we
shouldn't wear underwear because the nice folks at the nudist colony
don't either. :) Different culture, different issues, different
solutions.

Other people do have the same problems and concerns— though obviously
you're less likely to see them if they aren't indexed by google!
Being able to keep your messages out of the search indexes while
remaining open to anyone who is willing to click a few buttons is a
primary attraction of the yahoo-groups service.  Be thankful that we
don't force you though an infuriating web interface like they do.

I think everyone would like better search than we currently have
available. It should be possible to provide a solid search interface
without increasing the level of exposure.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] please make wikimedia.org mailing lists searchable

2009-08-22 Thread Gregory Maxwell
On Sat, Aug 22, 2009 at 11:20 PM, jida...@jidanni.org wrote:
 All I know is I don't know of any other examples of security through
 obscurity on mailing lists. Wasn't Jimbo inventing a new search engine?
 I don't know though... can't search for the announcement.


Download the gzipped mbox files from when you were not subscribed, for
example http://lists.wikimedia.org/pipermail/foundation-l/2009-July.txt.gz

Import this into the client software of your choice. Enjoy your
new-found ability to search.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Batik SVG-to-PNG server revisited

2009-08-16 Thread Gregory Maxwell
On Sun, Aug 16, 2009 at 8:00 PM, Hk knghk@web.de wrote:
 New test results were added at
 http://www.mediawiki.org/wiki/SVG_benchmarks

 This looks even better than my first attempt. Nonetheless, it is clear
 that batikd is not ready to use but needs to be worked on.

I'm not sure where the notion came up that median performance was a
useful criteria for selecting a rendering engine.

I'd expect that the criteria would be something like this:
0. security comfort (i.e. ability to deny local file access, strength
against overflow exploits)
1. worst case memory usage vs average
2. worst case cpu consumption vs average
3. Least surprising rendered output
4. average cpu consumption


Batik probably wins on 0, Inkscape wins on 3 (being bug compatible
with something the user can operate at home is arguably superior to
being correct), rsvg wins on 1,2,4 (and maybe daemonized batik is
getting close on 4).

Sometimes the CPU comparisons can be a bit hard... a rendering engine
which doesn't support SVG filters (i.e. old rsvg) will likely be
faster, but it will be producing unexpected output.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Video transcoding settings Was: [54611] trunk/extensions/WikiAtHome/WikiAtHome.php

2009-08-14 Thread Gregory Maxwell
On Fri, Aug 7, 2009 at 5:29 PM, d...@svn.wikimedia.org wrote:
 http://www.mediawiki.org/wiki/Special:Code/MediaWiki/54611

 Revision: 54611
 Author:   dale
 Date:     2009-08-07 21:29:26 + (Fri, 07 Aug 2009)

 Log Message:
 ---
 added a explicit keyframeInterval per gmaxwell's mention on wikitech-l. (I 
 get ffmpeg2theora: unrecognized option `--buf-delay for adding in buf-delay)


I thought firefogg was tracking j^'s nightly?  If the encoder has
two-pass it has --buf-delay. Does firefog perhaps need to be changed
to expose it?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Video Quality for Derivatives (was Re:w...@home Extension)

2009-08-06 Thread Gregory Maxwell
On Thu, Aug 6, 2009 at 8:00 PM, Michael Dalemd...@wikimedia.org wrote:
 So I committed ~basic~ derivate code support for oggHandler in r54550
 (more solid support on the way)

 Based input from the w...@home thread;  here are updated target
 qualities expressed via the firefogg api to ffmpeg2thoera

Not using two-pass on the rate controlled versions?

It's a pretty consistent performance improvement[8], and it eliminates
the first frame blurry issue that sometimes comes up for talking
heads. (Note, that by default two-pass cranks the keyframe interval to
256 and makes the buf-delay infinite. So you'll need to set those to
sane values for streaming).


[1] For example:
http://people.xiph.org/~maikmerten/plots/bbb-68s/managed/psnr.png

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Video Quality for Derivatives (was Re:w...@home Extension)

2009-08-06 Thread Gregory Maxwell
On Thu, Aug 6, 2009 at 8:17 PM, Gregory Maxwellgmaxw...@gmail.com wrote:
 On Thu, Aug 6, 2009 at 8:00 PM, Michael Dalemd...@wikimedia.org wrote:
 So I committed ~basic~ derivate code support for oggHandler in r54550
 (more solid support on the way)

 Based input from the w...@home thread;  here are updated target
 qualities expressed via the firefogg api to ffmpeg2thoera

 Not using two-pass on the rate controlled versions?

 It's a pretty consistent performance improvement[8], and it eliminates
 the first frame blurry issue that sometimes comes up for talking
 heads. (Note, that by default two-pass cranks the keyframe interval to
 256 and makes the buf-delay infinite. So you'll need to set those to
 sane values for streaming).

I see r54562 switching to two-pass, but as-is this will produce files
which are not really streamable (because they streams can and will
burst to 10mbits even though the overall rate is 500kbit or whatever
is requested).

We're going to want to do something like -k 64 --buf-delay=256.

I'm not sure what key-frame interval we should be using— Longer
intervals lead to clearly better compression, with diminishing returns
over 512 or so depending on the content... but lower seeking
granularity during long spans without keyframes.  The ffmpeg2theora
defaults are 64 in one-pass mode, 256 in two-pass mode.

Buf-delay indicates the amount of buffering the stream is targeting.
I.e. For a 30fps stream at 100kbit/sec a buf-delay of 60 means that
the encoder expects that the decoder will have buffered at least
200kbit (25kbyte) of video data before playback starts.

If the buffer runs dry the playback stalls— pretty crappy for the
user's experience.  So bigger buff delays either mean a longer
buffering time before playback or more risk of stalling.

In the above (30,60,100) example the client would require 2 seconds to
fill the buffer if they were transferring at 100kbit/sec, 1 second if
they are transferring at 200kbit/sec. etc.

The default is the same as the keyframe interval (64) in one pass
mode, and infinite in two-pass mode.  Generally you don't want the
buf-delay to be less than the keyframe interval, as quality tanks
pretty badly at that setting.

Sadly the video tag doesn't currently provide any direct way to
request a minimum buffering. Firefox just takes a guess and every time
it stalls it guesses more. Currently the guesses are pretty bad in my
experience, though this is something we'll hopefully get addressed in
future versions.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] How to securely connect to Wikipedia in a public wifi ?

2009-08-04 Thread Gregory Maxwell
On Tue, Aug 4, 2009 at 7:47 PM, Brion Vibberbr...@wikimedia.org wrote:
 On 8/3/09 6:28 PM, Remember the dot wrote:
 On Mon, Aug 3, 2009 at 2:16 PM, Brion Vibberbr...@wikimedia.org  wrote:
 Once we have a cleaner interface for hitting the general pages (without
 the 'secure.wikimedia.org' crappy single host)

 I'm curious...what will this cleaner interface look like? Will we be
 able to connect securely through https://en.wikipedia.org/?

 That's the idea... This means we need SSL proxies available on all of
 our front-end proxies instead of just on a dedicated location, and some
 hoop-jumping to get certificate hostnames to match, but it's not impossible.

 We did a little experimentation in '07 along these lines but just got
 busy with other things. :(


A useful data point is that greenrea...@wikifur has switched to using
protocol relative URLs rather than absolutes (i.e.
//host.domain.com/foo/bar) and had good luck with it.   This is an
additional data-point beyond the testing I did with en.wp last year.
(Last year while doing some ipv6 testing I also tested protocol
relatives and determined that all the clients with JS support were
unharmed by protocol relatives).

Ironically— the existence of secure.wikimedia.org with insecure images
is the only obstruction I see to switching images on the production
sites to protocol relatives in order to confirm client compatibility.

(For those following at home:  If Wikimedia can use protocol relatives
as a global replacement for absolutes to its own domains we can avoid
inadvertent secure/insecure mode switching and leaks without having to
have two copies of the article cache data and without kludgy
on-the-fly rewriting)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] w...@home Extension

2009-08-03 Thread Gregory Maxwell
On Mon, Aug 3, 2009 at 10:56 PM, Michael Dalemd...@wikimedia.org wrote:
 Also will hack in adding derivatives to the job queue where oggHandler
 is embed in a wiki-article at a substantial lower resolution than the
 source version. Will have it send the high res version until the
 derivative is created then purge the pages to point to the new
 location. Will try and have the download link still point to the high
 res version. (we will only create one or two derivatives... also we
 should decide if we want an ultra low bitrate (200kbs or so version for
 people accessing Wikimedia on slow / developing country connections)
[snip]


So I think there should generally be three versions, a 'very low rate'
suitable for streaming for people without excellent broadband, a high
rate suitable for streaming on good broadband, and a 'download' copy
at full resolution and very high rate.  (The download copy would be
the file uploaded by the user if they uploaded an Ogg)

As a matter of principle we should try to achieve both very high
quality and works for as many people as possible. I don't think we
need to achieve both with one file, so the high and low rate files
could specialize in those areas.


The suitable for streaming versions should have a limited
instantaneous bitrate (non-infinite buf-delay). This sucks for quality
but it's needed if we want streams that don't stall, because video can
easily have 50:1 peak to average rates over fairly short time-spans.
(It's also part of the secret sauce that differentiates smoothly
working video from stuff that only works on uber-broadband).

Based on 'what other people do' I'd say the low should be in the
200kbit-300kbit/sec range.  Perhaps taking the high up to a megabit?

There are also a lot of very short videos on Wikipedia where the whole
thing could reasonably be buffered prior to playback.


Something I don't have an answer for is what resolutions to use. The
low should fit on mobile device screens. Normally I'd suggest setting
the size based on the content: Low motion detail oriented video should
get higher resolutions than high motion scenes without important
details. Doubling the number of derivatives in order to have a large
and small setting on a per article basis is probably not acceptable.
:(

For example— for this
(http://people.xiph.org/~greg/video/linux_conf_au_CELT_2.ogv) low
motion video 150kbit/sec results in perfectly acceptable quality at a
fairly high resolution,  while this
(http://people.xiph.org/~greg/video/crew_cif_150.ogv) high motion clip
looks like complete crap at 150kbit/sec even though it has 25% fewer
pixels. For that target rate rhe second clip is much more useful when
downsampled: http://people.xiph.org/~greg/video/crew_128_150.ogv  yet
if the first video were downsampled like that it would be totally
useless as you couldn't read any of the slides.   I have no clue how
to solve this.  I don't think the correct behavior could be
automatically detected and if we tried we'd just piss off the users.


As an aside— downsampled video needs some makeup sharpening like
downsampled stills will. I'll work on getting something in
ffmpeg2theora to do this.

There is also the option of decimating the frame-rate. Going from
30fps to 15fps can make a decent improvement for bitrate vs visual
quality but it can make some kinds of video look jerky. (Dropping the
frame rate would also be helpful for any CPU starved devices)


Something to think of when designing this is that it would be really
good to keep track of the encoder version and settings used to produce
each derivative, so that files can be regenerated when the preferred
settings change or the encoder is improved. It would also make it
possible to do quick one-pass transcodes for the rate controlled
streams and have the transcoders go back during idle time and produce
better two-pass encodes.

This brings me to an interesting point about instant gratification:
Ogg was intended from day one to be a streaming format. This has
pluses and minuses, but one thing we should take advantage of is that
it's completely valid and well supported by most software to start
playing a file *as soon* as the encoder has started writing it. (If
software can't handle this it also can't handle icecast streams).
This means that so long as the transcode process is at least realtime
the transcodes could be immediately available.   This would, however,
require that the derivative(s) be written to an accessible location.
(and you will likely have to arrange so that a content-length: is not
sent for the incomplete file).

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] GIF thumbnailing

2009-08-02 Thread Gregory Maxwell
On Sun, Aug 2, 2009 at 10:26 AM, Ilmari Karonennos...@vyznev.net wrote:
[snip]
 It seems to me that delivering *static* thumbnails of GIF images, either
 in GIF or PNG format, would be a considerable improvement over the
 current situation.  And indeed, the code to do that seems to be already
 in place: just set $wgMaxAnimatedGifArea = 0;

So— separate from animation why would you use an gif rather than a
PNG?  I can think of two reasons:

(1) you're making a spacer image and the gif is actually smaller,
scaling isn't relevant here
(2) you're using gif transparency and are obsessed with compatibility
with old IE. Scaling doesn't tend to work really well with binary
transparency.


In other cases the gif tends to be larger, loads slower, etc.  They
can be converted to PNG losslessly, so you should probably do so.
What am I missing?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Wiki at Home Extension

2009-08-02 Thread Gregory Maxwell
On Sun, Aug 2, 2009 at 6:29 PM, Michael Dalemd...@wikimedia.org wrote:
[snip]
 two quick points.
 1) you don't have to re-upload the whole video just the sha1 or some
 sort of hash of the assigned chunk.

But each re-encoder must download the source material.

I agree that uploads aren't much of an issue.

[snip]
 other random clients that are encoding other pieces would make abuse
 very difficult... at the cost of a few small http requests after the
 encode is done, and at a cost of slightly more CPU cylces of the
 computing pool.

Is 2x slightly?  (Greater because some clients will abort/fail.)

Even that leaves open the risk that a single trouble maker will
register a few accounts and confirm their own blocks.  You can fight
that too— but it's an arms race with no end.  I have no doubt that the
problem can be made tolerably rare— but at what cost?

I don't think it's all that acceptable to significantly increase the
resources used for the operation of the site just for the sake of
pushing the capital and energy costs onto third parties, especially
when it appears that the cost to Wikimedia will not decrease (but
instead be shifted from equipment cost to bandwidth and developer
time).

[snip]
 We need to start exploring the bittorrent integration anyway to
 distribute the bandwidth cost on the distribution side. So this work
 would lead us in a good direction as well.

http://lists.wikimedia.org/pipermail/wikitech-l/2009-April/042656.html


I'm troubled that Wikimedia is suddenly so interested in all these
cost externalizations which will dramatically increase the total cost
but push those costs off onto (sometimes unwilling) third parties.

Tech spending by the Wikimedia Foundation is a fairly small portion of
the budget, enough that it has drawn some criticism.  Behaving in the
most efficient manner is laudable and the WMF has done excellently on
this front in the past.  Behaving in an inefficient manner in order to
externalize costs is, in my view, deplorable and something which
should be avoided.

Has some organizational problem arisen within Wikimedia which has made
it unreasonably difficult to obtain computing resources, but easy to
burn bandwidth and development time? I'm struggling to understand why
development-intensive externalization measures are being regarded as
first choice solutions, and invented ahead of the production
deployment of basic functionality.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] w...@home Extension

2009-08-01 Thread Gregory Maxwell
On Sat, Aug 1, 2009 at 12:13 AM, Michael Dalemd...@wikimedia.org wrote:
 true... people will never upload to site without instant gratification (
 cough youtube cough ) ...

Hm? I just tried uploading to youtube and there was a video up right
away. Other sizes followed within a minute or two.

 At any rate its not replacing the firefogg  that has instant
 gratification at point of upload its ~just another option~...

As another option— Okay. But video support on the site stinks because
of lack of server side 'thumbnailing' for video.  People upload
multi-megabit videos, which is a good thing for editing, but then they
don't play well for most users.

Just doing it locally is hard— we've had failed SOC projects for this—
doing it distributed has all the local complexity and then some.

 Also I should add that this w...@home system just gives us distributed
 transcoding as a bonus side effect ... its real purpose will be to
 distribute the flattening of edited sequences. So that 1) IE users can
 view them 2) We can use effects that for the time being are too
 computationally expensive to render out in real-time in javascript 3)
 you can download and play the sequences with normal video players and 4)
 we can transclude sequences and use templates with changes propagating
 to flattened versions rendered on the w...@home distributed computer

I'm confused as to why this isn't being done locally at Wikimedia.
Creating some whole distributed thing seems to be trading off
something inexpensive (machine cycles) for something there is less
supply of— skilled developer time.  Processing power is really
inexpensive.

Some old copy of ffmpeg2theora on a single core of my core2 desktop
process a 352x288 input video at around 100mbit/sec (input video
consumption rate). Surely the time and cost required to send a bunch
of source material to remote hosts is going to offset whatever benefit
this offers.

We're also creating a whole additional layer of cost in that someone
have to police the results.

Perhaps my tyler durden reference was too indirect:

* Create a new account
* splice some penises 30 minutes into some talking head video
* extreme lulz.

Tracking down these instance and blocking these users seems like it
would be a fulltime job for a couple of people and it would only be
made worse if the naughtyness could be targeted at particular
resolutions or fallbacks. (Making it less likely that clueful people
will see the vandalism)


 While presently many machines in the wikimedia internal server cluster
 grind away at parsing and rendering html from wiki-text the situation is
 many orders of magnitude more costly with using transclution and temples
 with video ... so its good to get this type of extension out in the wild
 and warmed up for the near future ;)

In terms of work per byte of input the wikitext parser is thousands of
times slower than the theora encoder. Go go inefficient software. As a
result the difference may be less than many would assume.

Once you factor in the ratio of video to non-video content for the
for-seeable future this comes off looking like a time wasting
boondoggle.

Unless the basic functionality— like downsampled videos that people
can actually play— is created I can't see there ever being a time
where some great distributed thing will do any good at all.

 The segmenting is going to significant harm compression efficiency for
 any inter-frame coded output format unless you perform a two pass
 encode with the first past on the server to do keyframe location
 detection.  Because the stream will restart at cut points.

 also true. Good thing theora-svn now supports two pass encoding :) ...

Yea, great, except doing the first pass for segmentation is pretty
similar to the computational cost as simply doing a one-pass encode of
the video.

 but an extra key frame every 30 seconds properly wont hurt your
 compression efficiency too much..

It's not just about keyframes locations— if you encode separately and
then merge you lose the ability to provide continuous rate control. So
there would be large bitrate spikes at the splice intervals which will
stall streaming for anyone without significantly more bandwidth than
the clip.

 vs the gain of having your hour long
 interview trans-code a hundred times faster than non-distributed
 conversion.  (almost instant gratification)

Well tuned you can expect a distributed system to improve throughput
at the expense of latency.

Sending out source material to a bunch of places, having them crunch
on it on whatever slow hardware they have, then sending it back may
win on the dollars per throughput front, but I can't see that having
good latency.

 true...  You also have to log in to upload to commons  It will make
 life easier and make abuse of the system more difficult.. plus it can

Having to create an account does pretty much nothing to discourage
malicious activity.

 act as a motivation factor with distribu...@home teams, personal stats
 and 

Re: [Wikitech-l] w...@home Extension

2009-08-01 Thread Gregory Maxwell
On Sat, Aug 1, 2009 at 2:54 AM, Brianbrian.min...@colorado.edu wrote:
 On Sat, Aug 1, 2009 at 12:47 AM, Gregory Maxwell gmaxw...@gmail.com wrote:
 On Sat, Aug 1, 2009 at 12:13 AM, Michael Dalemd...@wikimedia.org wrote:
 Once you factor in the ratio of video to non-video content for the
 for-seeable future this comes off looking like a time wasting
 boondoggle.
 I think you vastly underestimate the amount of video that will be uploaded.
 Michael is right in thinking big and thinking distributed. CPU cycles are
 not *that* cheap.

Really rough back of the napkin numbers:

My desktop has a X3360 CPU. You can build systems all day using this
processor for $600 (I think I spent $500 on it 6 months ago).  There
are processors with better price/performance available now, but I can
benchmark on this.

Commons is getting roughly 172076 uploads per month now across all
media types.  Scans of single pages, photographs copied from flickr,
audio pronouncations, videos, etc.

If everyone switched to uploading 15 minute long SD videos instead of
other things there would be 154,868,400 seconds of video uploaded to
commons per-month. Truly a staggering amount. Assuming a 40 hour work
week it would take over 250 people working full time just to *view*
all of it.

That number is an average rate of 58.9 seconds of video uploaded per
second every second of the month.

Using all four cores my desktop video encodes at 16x real-time (for
moderate motion standard def input using the latest theora 1.1 svn).

So you'd need less than four of those systems to keep up with the
entire commons upload rate switched to 15 minute videos.  Okay, it
would be slow at peak hours and you might wish to produce a couple of
versions at different resolutions, so multiply that by a couple.

This is what I meant by processing being cheap.

If the uploads were all compressed at a bitrate of 4mbit/sec and that
users were kind enough to spread their uploads out through the day and
that the distributed system were perfectly efficient (only need to
send one copy of the upload out), and if Wikimedia were only paying
$10/mbit/sec/month for transit out of their primary dataceter... we'd
find that the bandwidth costs of sending that source material out
again would be $2356/month. (58.9 seconds per second * 4mbit/sec *
$10/mbit/sec/month)

(Since transit billing is on the 95th percentile 5 minute average of
the greater of inbound or outbound uploads are basically free, but
sending out data to the 'cloud' costs like anything else).

So under these assumptions sending out compressed video for
re-encoding is likely to cost roughly as much *each month* as the
hardware for local transcoding. ... and the pace of processing speed
up seems to be significantly better than the declining prices for
bandwidth.

This is also what I meant by processing being cheap.

Because uploads won't be uniformly space you'll need some extra
resources to keep things from getting bogged at peak hours. But the
poor peak-to-average ratio also works against the bandwidth costs. You
can't win: Unless you assume that uploads are going to be very low
bitrates local transcoding will always be cheaper with very short
payoff times.

I don't know how to figure out how much it would 'cost' to have human
contributors spot embedded penises snuck into transcodes and then
figure out which of several contributing transcoders are doing it and
blocking them, only to have the bad user switch IPs and begin again.
... but it seems impossibly expensive even though it's not an actual
dollar cost.


 There is a lot of free video out there and as soon as we
 have a stable system in place wikimedians are going to have a heyday
 uploading it to Commons.

I'm not saying that there won't be video; I'm saying there won't be
video if development time is spent on fanciful features rather than
desperately needed short term functionality.  We have tens of
thousands of videos, much of which don't stream well for most people
because they need thumbnailing.

Firefogg was useful upload lubrication. But user-powered cloud
transcoding?  I believe the analysis I provided above demonstrates
that resources would be better applied elsewhere.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] w...@home Extension

2009-08-01 Thread Gregory Maxwell
On Sat, Aug 1, 2009 at 12:17 PM, Brianbrian.min...@colorado.edu wrote:
 A reasonable estimate would require knowledge of how much free video can be
 automatically acquired, it's metadata automatically parsed and then
 automatically uploaded to commons. I am aware of some massive archives of
 free content video. Current estimates based on images do not necessarily
 apply to video, especially as we are just entering a video-aware era of the
 internet. At any rate, while Gerard's estimate is a bit optimistic in my
 view, it seems realistic for the near term.

So—  The plan is that we'll lose money on every transaction but we'll
make it up in volume?

(Again, this time without math: The rate of increase as a function of
video-minutes of the amortized hardware costs costs for local
transcoding is lower than the rate of increase in bandwidth costs
needed to send off the source material to users to transcode in a
distributed manner. This holds for pretty much any reasonable source
bitrate, though I used 4mbit/sec in my calculaton.  So regardless of
the amount of video being uploaded using users is simply more
expensive than doing it locally)

Existing distributed computing projects work because the ratio of
CPU-crunching to communicating is enormously high. This isn't (and
shouldn't be) true for video transcoding.

They also work because there is little reward for tampering with the
system. I don't think this is true for our transcoding. There are many
who would be greatly gratified by splicing penises into streams far
more so than anonymously and undetectably making a protein fold wrong.

... and it's only reasonable to expect the cost gap to widen.

On Sat, Aug 1, 2009 at 9:57 AM, David Gerarddger...@gmail.com wrote:
 Oh hell yes. If I could just upload any AVI or MPEG4 straight off a
 camera, you bet I would. Just imagine what people who've never heard
 the word Theora will do.

Sweet! Except, *instead* of developing the ability to upload straight
off a camera what is being developed is user-distributed video
transcoding— which won't do anything itself to make it easier to
upload.

What it will do is waste precious development cycles maintaining an
overly complicated software infrastructure, waste precious commons
administration cycles hunting subtle and confusing sources of
vandalism, and waste income from donors by spending more on additional
outbound bandwidth than would be spent on computing resources to
transcode locally.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] w...@home Extension

2009-08-01 Thread Gregory Maxwell
On Sat, Aug 1, 2009 at 1:13 PM, Brianbrian.min...@colorado.edu wrote:

 There are always tradeoffs. If I understand w...@home correctly it is also
 intended to be run @foundation. It works just as well for distributing
 transcoding over the foundation cluster as it does for distributing it to
 disparate clients.

There is nothing in the source code that suggests that.

It currently requires the compute nodes to be running the firefogg
browser extension.  So this would require loading an xserver and
firefox onto the servers in order to have them participate as it is
now.  The video data has to take a round-trip through PHP and the
upload interface which doesn't really make any sense, that alone could
well take as much time as the actual transcode.

As a server distribution infrastructure it would be an inefficient one.

Much of the code in the extension appears to be there to handle issues
that simply wouldn't exist in the local transcoding case.   I would
have no objection to a transcoding system designed for local operation
with some consideration made for adding externally distributed
operation in the future if it ever made sense.

Incidentally— The slice and recombine approach using oggCat in
WikiAtHome produces files with gaps in the granpos numbering and audio
desync for me.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Alternative editing interfaces using write API (was: Re: Watchlistr.com, an outside site that asks for Wikimedia passwords)

2009-07-24 Thread Gregory Maxwell
On Wed, Jul 22, 2009 at 10:05 PM, Brianna
Laugherbrianna.laug...@gmail.com wrote:
[snip]
 I can imagine someone building an alternative edit interface for a
 subset of Wikipedia content, say a WikiProject. Then the interface can
 strip away all the general crud and just provide information relevant
 to that topic area.

Sweet.

I look forward to the bright future where I can create an enhanced
AJAX edit-box for MediaWiki then throw it up with a bunch of ads and
private-data-collection and avoid the pesky problem of open sourcing
my code and contributing it back to the MediaWiki codebase in order to
get it widely used.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Do no harm

2009-07-23 Thread Gregory Maxwell
On Thu, Jul 23, 2009 at 11:07 AM, dan nessettdness...@yahoo.com wrote:
[snip]
 On the other hand, if there were regression tests for the main code and for 
 the most important extensions, I could make the change, run the regression 
 tests and see if any break. If some do, I could focus my attention on those 
 problems. I would not have to find every place the global is referenced and 
 see if the change adversely affects the logic.

This only holds if the regression test would fail as a result of the
change. This is far from a given for many changes and many common
tests.

Not to mention the practical complications— many extensions have
complicated configuration and/or external dependencies.  make
test_all_extensions is not especially realistic.

Automated tests are good, necessary even, but they don't relieve you
of the burden of directly evaluating the impact of a broad change.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Clickjacking and CSRF

2009-07-22 Thread Gregory Maxwell
On Wed, Jul 22, 2009 at 12:54 PM, Aryeh
Gregorsimetrical+wikil...@gmail.com wrote:
 Well, in this case we're not even talking about something that would
 go into HTML 5, necessarily, it's being developed by only Mozilla
 right now.  If more important Wikimedia people than I state agreement
 with me about the importance of the feature to easy CSP deployment, I
 think that will be more useful than flaming anyone.  Or if they
 disagree, they should say so so I don't mislead the Mozilla people
 into thinking the feature needs to be added to the spec.
[snip]

This point is worth saying twice.

If some minor tweak (like a monitor but not enforce mode) is necessary
and sufficient for the Mediawiki core devs to commit to using the
feature (and for Wikimedia to roll it on Wikipedia) then that should
carry significant weight for both the implemetors and whatwg as a
whole.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Watchlistr.com, an outside site that asks for Wikimedia passwords

2009-07-22 Thread Gregory Maxwell
On Wed, Jul 22, 2009 at 4:18 PM, David Gerarddger...@gmail.com wrote:
 Mmm. So solving this properly would require solving many of the
 various consolidated/multiple watchlist bugs in MediaWiki itself,
 then.

Hm? No. Solving *this* involves having a sysadmin determine the source
of IP of the remote logins and scrambling the password of every
account which has logged in through it.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] just a note...

2009-07-11 Thread Gregory Maxwell
On Sat, Jul 11, 2009 at 6:13 PM, Domas Mituzasmidom.li...@gmail.com wrote:
 Could you elaborate on what template and why changing a single
 template should have that large an effect?

 tomorrow =)


I'm guessing something that added some categories to some very widely
used infobox or licensing templates.


Do I win?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Recommending a browser for video (was: Proposal: switch to HTML 5)

2009-07-09 Thread Gregory Maxwell
On Thu, Jul 9, 2009 at 5:23 PM, David Gerarddger...@gmail.com wrote:
 2009/7/9 Platonides platoni...@gmail.com:

 I advocate a simply: You can [[install X]] to get native support. [[More
 info]]


 What do we do for iPhone users? They do not have Theora support
 because Apple has actively decided it will not support it; we can
 either appear to be defective, or we can correctly assign
 responsibility. I assume Apple is not ashamed of their decision to
 exclude Theora.

Obviously the solution is to send the user to instructions on how to
jailbreak their iphone and install theora support.  Duh.

;)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Recommending a browser for video (was: Proposal: switch to HTML 5)

2009-07-09 Thread Gregory Maxwell
On Thu, Jul 9, 2009 at 6:20 PM, David Gerarddger...@gmail.com wrote:
 2009/7/9 Aryeh Gregor simetrical+wikil...@gmail.com:

 Assuming that native support really is noticeably better.  Maybe we
 could only suggest it if we detect that the playback is stuttering, or
 suggest it more prominently if we detect that.  I assume Cortado can
 detect that.  Are there noticeable advantages to native playback other
 than better performance?


 Yes: not waiting thirty seconds for Java to start up.

10 of which your browser pretending to be crashed in many cases.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] secure slower and slower

2009-07-08 Thread Gregory Maxwell
On Wed, Jul 8, 2009 at 9:05 AM, David Gerarddger...@gmail.com wrote:
 2009/7/7 Aryeh Gregor simetrical+wikil...@gmail.com:

 But really -- have there been *any* confirmed incidents of MITMing an
 Internet connection in, say, the past decade?  Real malicious attacks
 in the wild, not proof-of-concepts or white-hat experimentation?  I'd
 imagine so, but for all people emphasize SSL, I can't think of any
 specific case I've heard of, ever.  It's not something normal people
 need to worry much about, least of all for Wikipedia.

 Nope. The SSL threat model is completely arse-backwards. It assumes
 secure endpoints and a vulnerable network. Whereas what we see in
 practice is Trojaned endpoints and no-one much bothering with the
 network.

Actually, there is a lot of screwing with the network.

For instance, take the UK service providers surreptitiously modifying
Wikipedia's responses on the fly to create a fake 404 when you hit
particular articles.

I believe it's a common practice for US service providers to sell
information feeds about user's browsing data (believe because I know
it's done, but don't have concrete information about how common it
is). Your use of Wikipedia likely has less privacy than your use of a
public library.

SSL kills these attacks dead.

People whom try to read via Tor to avoid the above mentioned problems
subject themselves to naughty activities by unscrupulous exit
operators. MITM activities by Tor exit operators are common and well
documented.  SSL would remove some of the incentive to use Tor (since
your local network/ISP could no longer spy on you if you used SSL) and
would remove most of Tor's grievous hazard for those who continue to
use it to read.

There are some truly nasty things you can do with an enwiki admin
account. They can be undone, sure, but a lot of damage can be done.
They are obvious enough, and have been discussed in backrooms enough
that I don't think I'll do much harm by listing a few of them:

(1) By twiddling site JS you can likely knock any site off the
internet by scripting clients to connect to the sites frequently.
Although this can be deactivated once it was discovered, due to
caching it would hang around for a while.  Well timed even a short
outage could cause significant dollar value real damage.

(2) You could script clients to kick users to a malware installer.
Again, it could be quickly undone, but a lot of damage could be caused
with only a few minutes of script placement. Generally you could use
WP as a nice launching ground for any kind of XSS vulnerability that
you're already aware of.

Any of these JS attacks could be enhanced by only making them
effective for anons, reducing their visibility, and by making the JS
modify the display of the Mediawiki: pages to both hide the bad JS
from users and to make it impossible to remove without disabling
client JS.  Provided your changes didn't break the site, I'd take a
bet that you could have a malware installer running for days before it
was discovered.

(3) You could rapidly merge page histories for large numbers of
articles, converting their histories into jumbled messes.  I don't
believe we yet have any automated solution to fix that beyond restore
the site from backups.

(4) Any admin account can be used to capture bureaucrat and/or
checkuser access by injecting user JS to one of these users and using
it to steal their session cookie (unless the change to SUL stopped
this, but I don't see how it could have; even if so you could remote
pilot them). With checkuser access you can quickly dump out decent
amounts of private data. The leak of private data can never be undone.
  (or, alternatively, you can just MTIM a real steward, checkuser, or
bureaucrat (say, at wikimania or a wiki meetup :) ) and get their
access directly).


These are just a few things… I'm sure if you think creatively you can
come up with more.  The use of SSL makes attacks harder and some types
of attack effectively impossible. It should be considered important.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Proposal: switch to HTML 5

2009-07-08 Thread Gregory Maxwell
On Wed, Jul 8, 2009 at 2:23 PM, Michael Dalemd...@wikimedia.org wrote:
 The current language is For best video playback experience we recommend
 _Firefox 3.5_ ... but I am open to adjustments.

I'd drop the word experience. It's superfluous marketing speak.

So the notice chain I'm planning on adding to the simple video/
compatibility JS is something like this:

If the user is using safari4 on a desktop system and doesn't have xiphqt:
* Advise the user to install XiphQT (note, there should be a good
installer available soon)

The rational being that if they are known to use safari now they
probably will in the future, better to get them to install XiphQT than
to hope they'll continue using another browser.

If the users is using any of a list of platforms known to support firefox:
* Advise them to use firefox 3.5

Otherwise say nothing.
It would be silly at this time to be advising users of some
non-firefox-supporting mobile device that firefox 3.5 provides the
best experience. ;)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Recommending a browser for video (was: Proposal: switch to HTML 5)

2009-07-08 Thread Gregory Maxwell
On Wed, Jul 8, 2009 at 2:56 PM, Aryeh
Gregorsimetrical+wikil...@gmail.com wrote:
 On Wed, Jul 8, 2009 at 2:43 AM, Marco
 Schusterma...@harddisk.is-a-geek.org wrote:
 We should not recommend Chrome - as good as it is, but it has serious
 privacy problems.
 Opera is not Open Source, so I think we'd best stay with Firefox, even if
 Chrome/Opera begin to support video tag.

 I don't think we should use these kinds of ideological criteria when
 making any sort of recommendation here.  We should state in a purely
 neutral fashion that browsers X, Y, and Z will result in the video
 playing better on your computer than your current browser does.  It
 would be misleading to imply that Firefox is superior to these other
 browsers for the purposes of playing the video tag.

Not every decision is a purely technical. Mozilla has done a lot to
support the development of this functionality. Putting other browser
developers on equal footing is not an neutral decision either.

The ideological, and other, criteria is moot when there is only one
thing to recommend.

 On Wed, Jul 8, 2009 at 2:42 PM, Gregory Maxwellgmaxw...@gmail.com wrote:
 That sounds good.  Why not recommend Safari plus XiphQT as well, if
 the goal is only to tell them what browsers support good video
 playback?

Hm. Two things to install rather than one?

For the moment there is also a technical problem with Safari 4: It
claims (via the canPlayType() call) that it can't support Ogg even
when XiphQT is installed.  We currently work around this by detecting
the mime-type registration which happens as part of the XiphQT
installation.  In practice this means that Safari 4 will work with Ogg
video on sites using OggHandler, but not on many others.

Safari also isn't an especially widely adopted browser outside of
apple systems. Should we also recommend the dozens of oddball free
geko and webkit based browsers supporting video/ which are soon to
exist?   Flooding the users with options is a good way to turn them
off. There is already at least one (Midori).

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Proposal: switch to HTML 5

2009-07-08 Thread Gregory Maxwell
On Wed, Jul 8, 2009 at 3:06 PM, David Gerarddger...@gmail.com wrote:
 2009/7/8  j...@v2v.cc:
 David Gerard wrote:

 You are using Internet Explorer. Install the Ogg codecs _here_ for a
 greatly improved Wikimedia experience.

 Internet Explorer does not support the video tag, installing Ogg
 DirectShow filters does not help there.


 Yes, I realised this just after sending my email :-)

 I presume, though, there's some way of playing videos in IE. Is there
 a way to tell if the Ogg filters are installed?

Java or via the VLC plugin

At least the safari + xiphqt has the benefit of working as well as
firefox 3.5 does. The same is not true for Java or VLC.  (the VLC
plugin is reported to cause many browser crashes, Java is slow to
launch and somewhat CPU hungry)

I've suggested making the same installer for XiphQT for win32 also
install the XiphDS plugins, which would make things easier on users.
But XiphDS does not help with in-browser playback today.


Since, at the moment, firefox is the only non-beta browser with direct
support I don't see why plugging Firefox would be controversial. It's
a matter of fact that it works best with Firefox 3.5 or Safari+XiphQT.
Later when there are several options things will be a little more
complicated.  Certainly I don't think any recommendation should be
made when the user already has native-grade playback.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Recommending a browser for video (was: Proposal: switch to HTML 5)

2009-07-08 Thread Gregory Maxwell
On Wed, Jul 8, 2009 at 6:12 PM, David Gerarddger...@gmail.com wrote:
 2009/7/8 Aryeh Gregor simetrical+wikil...@gmail.com:
 On Wed, Jul 8, 2009 at 4:27 PM, David Gerarddger...@gmail.com wrote:

 Uh, it's not a good option for Wikimedia video.

 With XiphQT, why not?  Maybe not ideal, but surely good.


 As Greg has noted, due to a bug in Safari it's impossible for the
 browser at present to indicate that it can handle Ogg or not.

 So how do we tell if the Safari user can use that or if they have to
 download XiphQT? There isn't a way at present. Either we shove Safari
 on Mac users onto Cortado by default (since Java can be presumed
 present on MacOS X) or we risk giving them a video element that
 doesn't work.

 (Unless the failure can somehow be sniffed.)

Well *we* do. As a side effect of installing XiphQT a mime type is
registered.  This is completely independent of the video tag.  So
we'll detect this and use it anyways.

I believe we're the only users of video whom have ever done this. It's
not obvious, and I doubt we'd be doing it were it not for the fact
that that detection method was previously used for detecting pre-video
availability of XiphQT.

(FWIW, that behaviour is now fixed in their development builds)

Regardless, I think we've finished the technical part of this
decision— the details are a matter of organization concern now, not
technology.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Proposal: switch to HTML 5

2009-07-07 Thread Gregory Maxwell
On Tue, Jul 7, 2009 at 1:54 AM, Aryeh
Gregorsimetrical+wikil...@gmail.com wrote:
[snip]
 * We could support video/audio on conformant user agents without
 the use of JavaScript.  There's no reason we should need JS for
 Firefox 3.5, Chrome 3, etc.


Of course, that could be done without switching the rest of the site to HTML5...

Although I'm not sure that giving the actual video tags is desirable.
It's a tradeoff:

Work for those users when JS is enabled and correctly handle saving
the full page including the videos vs take more traffic from clients
doing range requests to generate the poster image, and potentially
traffic from clients which decide to go ahead and fetch the whole
video regardless of the user asking for it.

There is also still a bug in FF3.5 that where the built-in video
controls do not work when JS is fully disabled. (Because the controls
are written in JS themselves)


(To be clear to other people reading this the mediawiki ogghandler
extension already uses HTML5 and works fine with Firefox 3.5, etc. But
this only works if you have javascript enabled.  The site could
instead embed the video elements directly, and only use JS to
substitute the video tag for fallbacks when it detects that the video
tag can't be used)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Proposal: switch to HTML 5

2009-07-07 Thread Gregory Maxwell
On Tue, Jul 7, 2009 at 7:53 PM, Michael Dalemd...@wikimedia.org wrote:
[snip]
 I don't really have apple machine handy to test quality of user
 experience in OSX safari with xiph-qt. But if that is on-par with
 Firefox native support we should probably link to the component install
 instructions for safari users.

I believe it's quite good. Believe is the best I can offer never
having personally tested it.  I did work with a safari user sending
them specific test cases designed to torture it hard (and some XiphQT
bugs were fixed in the process) and at this point it sounds pretty
good.

What I have not stressed is any of the JS API. I know it seeks, I have
no clue how well, etc.

There is also an apple webkit developer who is friendly and helpful at
getting things fixed whom we work with if we do encounter bugs... but
more testing is really needed.

Safari users wanted.


As far as the 'soft push' ... I'm generally not a big fan of one-shot
completely dismissible nags: Too often I click past something only to
realize shortly thereafter that I really should have clicked on it.
I'd prefer something that did a significant (alert-level) nag *once*
but perpetually included a polite Upgrade your Video button below
(above?) the fallback video window.

There is only a short period of time remaining where a singular
browser recommendation can be done fairly and neutrally. Chrome and
Opera will ship production versions and then there will be options.
Choices are bad for usability.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Regular expressions searching

2009-07-06 Thread Gregory Maxwell
On Mon, Jul 6, 2009 at 2:37 PM, Aryeh
Gregorsimetrical+wikil...@gmail.com wrote:
 On Mon, Jul 6, 2009 at 7:43 AM, Andrew Garrettagarr...@wikimedia.org wrote:
 Yes.

 We wouldn't allow direct searching from the web interface with regexes
 for two related reasons:

 1/ A single search with the most basic of regexes would take several
 minutes, if not hours. It isn't computationally trivial to search for
 a small string in a very large string of over 10 GB, let alone a
 regex. Words can be indexed, regexes cannot.

 2/ Even if we could find a way to make the former performant, a
 malicious regex could significantly expand this time taken, leading to
 a denial of service.

 I seem to recall Gregory Maxwell describing a setup that made this
 feasible, given the appropriate amount of dedicated hardware.  It was
 run with the entire database in memory; it only permitted real
 regular expressions (compilable to finite-state machines, no
 backreferences etc.); and it limited the length of the finite-state
 machine generated.  Running a regex took several minutes, but he'd run
 a lot of them in parallel, since it was mostly memory-bound, so he got
 fairly good throughout.  Something like that.

 But probably not practical without an undue amount of effort and
 hardware, yeah.  :)

Yes, I didn't comment on the initial comment because full PCRE is
simply far too much to ask for.

Basic regexps of the sort that can be complied into a deterministic
finite state machine (i.e. no backtracking) can be merged together
into a single larger state machine.

So long as the state machine fits in cache, the entire DB can be
scanned in not much more time than it takes to read it in from memory,
even if there are hundreds of parallel regexpes.

So you batch up user requests then run them in parallel groups. Good
throughput, poor latency.

Insufficiently selective queries are problematic. I never came up with
a really good solution to people feeding in patterns like '.' and
stalling the whole process by wasting a lot of memory bandwidth
updating the result set. (an obvious solution might just be to limit
the number of results)

The latency can be reduced by partitioning the database across
multiple machines (more aggregate memory bandwidth).  By doing this
you could achieve arbitrarily low latency and enormous throughput.
Dunno if it's actually worthwhile, however.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] On templates and programming languages

2009-07-01 Thread Gregory Maxwell
On Wed, Jul 1, 2009 at 1:42 AM, Dmitriy Sintsovques...@rambler.ru wrote:
 XSLT itself is a way too much locked down - even simple things like
 substrings manipulation and loops aren't so easy to perform. Well, maybe
 I am too stupid for XSLT but from my experience bringing tag syntax in
 programming language make the code poorly readable and bloated. I've
 used XSLT for just one of my projects.

Juniper Networks (my day job) uses XSLT as the primary scripting
language on their routing devices, and chose to do so primarily
because of sandboxing and the ease of XML tree manipulation with xpath
(JunOS configuration has a complete and comprehensive XML
representation).  To facilitate that usage we defined an alternative
syntax for XSLT called SLAX (http://code.google.com/p/libslax/),
though it hasn't seen widespread adoption outside of Juniper yet.
(Slax can be mechanically converted to XSLT and vice versa)

SLAX pretty much resolves your readability concern. Although there are
the conceptual barriers for people coming from procedural languages to
any strongly functional programming language still remain.

You don't loop in XSLT, you recurse or iterate over a structure (i.e.
map/reduce).

I've grown rather fond of XSLT but wouldn't personally recommend it
for this application. It lacks the high speed bytecoded execution
environments available for other languages, snf I don't see many
scripts on the site doing extensive document tree manipulation (it's
hard for me to express how awesome xpath is at that)... and I would
also guess that there are probably more adept mediawiki template
language coders today than there are people who are really fluent in
XSLT.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] On templates and programming languages

2009-07-01 Thread Gregory Maxwell
On Wed, Jul 1, 2009 at 3:50 AM, William Allen
Simpsonwilliam.allen.simp...@gmail.com wrote:
 Javascript, OMG don't go there.

Don't be so quick to dismiss Javscript.  If we were making a scorecard
it would likely meet most of the checkboxes:

* Available of reliable battle tested sandboxes (and probably the only
option discussed other than x-in-JVM meeting this criteria)
* Availability of fast execution engines
* Widely known by the existing technical userbase   (JS beats the
other options hands down here)
* Already used by many Mediawiki developers
* Doesn't inflate the number of languages used in the operation of the site
* Possibility of reuse between server-executed and client-executed
(Only JS of the named options meets this criteria)
* Can easily write clear and readable code
* Modern high level language features (dynamic arrays, hash tables, etc)

There may exist great reasons why another language is a better choice,
but JS is far from the first thing that should be eliminated.

Python is a fine language but it fails all the criteria I listed above
except the last two.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] On templates and programming languages

2009-07-01 Thread Gregory Maxwell
On Wed, Jul 1, 2009 at 11:21 AM, William Allen
Simpsonwilliam.allen.simp...@gmail.com wrote:
 * Doesn't inflate the number of languages used in the operation of the site

 This is the important checkbox, as far as integration with the project (my
 first criterion), but is the server side code already running JavaScript?
 For serving pages?

No but mediawiki and the sites are already chock-full of client side code in JS.

You basically can't do advanced development for MediaWiki or the
wikimedia sites without a degree of familiarity with Javascript due to
client compatibility considerations.

 My general rule: coming over the network, presume it's bad data.

In this case were not talking about the language mediawiki is written
in, we're talking about a language used for server-side content
automation (templates).  In that case we'd be assuming the inputs are
toxic just like in the client side case, since everything, including
the code itself came in over the network.

I'll concede that there likely wouldn't be much code reuse, but I'd
attribute that more to the starkly different purpose and the fact that
the server version would have a different API (no DOM, but instead
functions for pulling data out of mediawiki).


 And we have far too many examples of existing JS
 already being used in horrid templates, being promulgated in important
 areas such as large categories, that don't seem to work consistently, and
 don't work at all with JavaScript turned off.
 I run Firefox with JS off by default for all wikimedia sites, because of
 serious problems in the not so recent past!

Fortunately this is a non-issue here: Better server side scripting
enhances the sites ability to operate without requiring scripting on
the client.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Technical solution to the privileged users adding web bugs problem

2009-06-29 Thread Gregory Maxwell
Shutting Down XSS with Content Security Policy
http://blog.mozilla.com/security/2009/06/19/shutting-down-xss-with-content-security-policy/

I'm usually the first to complain about applying technical solutions
to problems which are not fundamentally technical... but this looks
like it would be reasonably expedient to implement.

While it won't be effective for all users the detection functionality
would be a big improvement in wrangling these problems across the
hundreds of Wikimedia projects, many of which lack reasonable
oversight of their sysop activities.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Technical solution to the privileged users adding web bugs problem

2009-06-29 Thread Gregory Maxwell
On Mon, Jun 29, 2009 at 7:56 PM, Aryeh
Gregorsimetrical+wikil...@gmail.com wrote:
 I think this would be reasonable to consider implementing as soon we
 have a significant number of users using it.  It isn't a good idea to
 make CSP policies that won't actually be effective immediately for a
 lot of people, because then we'll probably use it incorrectly, break
 tons of stuff, and not even notice for months or years (possibly even
 harming uptake of the first version of Firefox to support it).
 This does seem to be Mozilla-only, though.  If it were an open
 specification that multiple vendors were committed to implementing,
 that would make it significantly more attractive.  I wonder why
 Mozilla isn't proposing this through the W3C from the get-go.

When to do it is a philosophical issue:

Arguably it should be turned on early so that the early adopters of
the technology (i.e. firefox devs!) will be test subjects. Support for
audio/video tag in Wikipedia has been helpful in the development of
firefox audio and video tag support.  If the feature is turned on only
once these clients are widely deployed then we'll have a situation
where things may be broken for many users.

So— turn it on early and have many things will broken for a small
number of technically savvy users, up to the point potentially slowing
the adoption of a future browser release. ... or turn it on later when
it will likely cause a few problems but for 30% of the sites visitors?

The latter sounds like too much of a flag-day.

The stuff likely to stay broken after the initial implementation are
things like userscripts. Those are just going to take a long time to
fix no matter what. The best thing there would be to communicate the
correct practices well in advance so that the natural development
cycle picks them up, but I'm not aware of any way to communicate such
a thing except by making the wrong ways not work.

 We'd have to do some work to get full benefit from this, since we
 currently use stuff like inline script all over the place.  But it

Right, though with all the minification interest I've seen here lately
it sounds like a great time to hoist all that stuff out of the pages.

 would be fairly trivial to use only *-src to deny any remote loading
 of content from non-approved domains, and skip the rest.  That would
 at least mitigate XSS some, but it would stop the privacy issues we've
 been having cold, as you say.

I think one really compelling thing about it is that supporting
clients can provide feedback to the webserver.  This means that every
supporting user will be an XSS test probe, a canary in the page-mine.
So even if this doesn't become standardized and widely adopted by
clients other than firefox it would reduce the damage of unintentional
but well meaning privacy leaks since we'd get notice of them very
quickly rather than months later.

Hopefully this will be more widely adopted, because I think that the
available knobs provide a level of functionality which we couldn't
achieve any other way. (i.e. we could deny html/script injection
completely in mediawiki, but limiting scripts to accessing particular
domains isn't something mediawiki could reasonably do itself)


I don't know enough to comment on the W3C path— but I have no
particular reason to think it wouldn't happen: W3C activity is almost
universally lagging rather than leading. Things like this aren't
generally matters for discussion unless someone is thinking of
implementing them.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Mediawiki and html5

2009-06-27 Thread Gregory Maxwell
On Sat, Jun 27, 2009 at 4:39 AM, Strainustrain...@gmail.com wrote:
 Hi,

 I've heard that wikipedia will be among the first content providers to
 support the video and audio tags in html5. I'm trying to put up a
 presentation about the subject for a FF3.5 release party and I would
 like to find out more. Could you point me to some documents or answer
 some of the questions below?

 1) When will this support appear?
 2) Has the code already been modified accordingly?

Stephen Bain addressed this admirably, but I thought I should add that
the support has been there for years now. We've been waiting browser
vendors to catch up.

Even prior to Opera's push for the video tag we had in-browser java
based playback of Ogg files on English Wikipedia.

 3) How much time will legacy browsers be supported?

For Wikimedia legacy browser support is fairly inexpensive: They play
back the same files that the video/audio users.  So legacy support can
last as long as its relevant.

For sites who have used other formats for legacy browsers, they have
the cost of maintaining another set of encodes and format royalties,
so for them there may be more incentive to drop legacy support.

There is also a question of what constitutes 'legacy': There is one
desktop browser can play our video perfectly adequately using the
HTML5 tags, but it requires a codec pack.

 4) What prompted this desire to be an early adopter of this technology?

Wikimedia has a long-standing commitment to open and unencumbered file
formats which stems back to nearly the start of the projects. The
mission of the Wikimedia Foundation is to empower and engage people
around the world to collect and develop educational content under a
free license or in the public domain, and to disseminate it
effectively and globally, and it has been the belief that people are
more empowered when they don't feel forced for compatibility reasons
to use formats they have to ask permission for and pay for.

As such, the use of encumbered video technology such as flash is not
something that would be decided lightly.

The adoption of the HTML5 tags follows naturally from this
pre-existing behavior as a way of getting media working for a larger
portion of the userbase.

 5) Will other codecs except Theora be supported?

The list of file types supported today can be found here:
http://commons.wikimedia.org/wiki/Commons:File_types

Really the support just depends on the intersection of the project
requirements (as of today: free and unencumbered formats) and client
support (as of today, Ogg/Theora has the widest client compatibility
for HTML5 video).

The thumbnailing infrastructure for video currently only handles
Ogg/Theora but other formats could be easily added.

This one isn't really a technical question.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] subst'ing #if parser functions loses line breaks, and other oddities

2009-06-26 Thread Gregory Maxwell
On Fri, Jun 26, 2009 at 12:01 PM, Gerard
Meijssengerard.meijs...@gmail.com wrote:
 Hoi,
 At some stage Wikipedia was this thing that everybody can edit... I can not
 and will not edit this shit so what do you expect from the average Joe ??

I can not (effectively) contribute to
http://en.wikipedia.org/wiki/Ten_Commandments_in_Roman_Catholicism

Does this mean Wikipedia is a failure?

I don't think so.  Not everyone needs to be able to do everything.
Thats one reasons projects have communities: Other people can do the
work which I'm not interested in or not qualified for.  Not everyone
needs to make templates— and there are some people who'd have nothing
else to do but add fart jokes to science articles if the site didn't
have plenty of template mongering that needed doing.

Unfortunately the existing system is needlessly exclusive. The
existing parser function uses solution are so byzantine that even many
people with the right interest and knowledge are significantly put off
from it.

The distinction between this and a general easy to use is a very
critical one.

It's also the case that the existing system's problems spills past its
borders due to its own limitations: Regular users need to deal with
things like weird whitespace handling and templates which MUST be
substed (or can't be substed; at random from the user's perspective).
This makes the system harder even for the vast majority of people who
should never need to worry about the internals of the templates.

I think this is the most important issue, and its one with real
usability impacts,  but it's not due to the poor syntax. On this
point, the template language could be intercal but still leave most
users completely free to ignore the messy insides. The existing system
doesn't because there is no clear boundary between the page and the
templates (among other reasons, like the limitations of the existing
'string' manipulation functions).

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Minify

2009-06-26 Thread Gregory Maxwell
On Fri, Jun 26, 2009 at 4:33 PM, Michael Dalemd...@wikimedia.org wrote:
 I would quickly add that the script-loader / new-upload branch also
 supports minify along with associating unique id's grouping  gziping.

 So all your mediaWiki page includes are tied to their version numbers
 and can be cached forever without 304 requests by the client or _shift_
 reload to get new js.

Hm. Unique ids?

Does this mean the every page on the site must be purged from the
caches to cause all requests to see a new version number?

Is there also some pending squid patch to let it jam in a new ID
number on the fly for every request? Or have I misunderstood what this
does?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Unbreaking statistics

2009-06-05 Thread Gregory Maxwell
On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohderaro...@gmail.com wrote:
 On Fri, Jun 5, 2009 at 6:38 PM, Tim Starlingtstarl...@wikimedia.org wrote:
 Peter Gervai wrote:
 Is there a possibility to write a code which process raw squid data?
 Who do I have to bribe? :-/

 Yes it's possible. You just need to write a script that accepts a log
 stream on stdin and builds the aggregate data from it. If you want
 access to IP addresses, it needs to run on our own servers with only
 anonymised data being passed on to the public.

 http://wikitech.wikimedia.org/view/Squid_logging
 http://wikitech.wikimedia.org/view/Squid_log_format


 How much of that is really considered private?  IP addresses
 obviously, anything else?

 I'm wondering if a cheap and dirty solution (at least for the low
 traffic wikis) might be to write a script that simply scrubs the
 private information and makes the rest available for whatever
 applications people might want.

There is a lot of private data in user agents (MSIE 4.123; WINNT 4.0;
bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34 may be
uniquely identifying). There is even private data titles if you don't
sanitize carefully
(/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box).
 There is private data in referrers
(http://rarohde.com/url_that_only_rarohde_would_have_comefrom).

Things which individually do not appear to disclose anything private
can disclose private things (look at the people uniquely identified by
AOL's 'anonymized' search data).

On the flip side, aggregation can take private things (i.e.
useragents; IP info; referrers) and convert it to non-private data:
Top user agents; top referrers; highest traffic ASNs... but becomes
potentially revealing if not done carefully: The 'top' network and
user agent info for a single obscure article in a short time window
may be information from only one or two users, not really an
aggregation.

Things like common paths through the site should be safe so long as
they are not provided with too much temporal resolution, limit
themselves to existing articles, and limit themselves to either really
common paths or breaking paths into two or three node chains and skip
releasing the least common of those.

Generally when dealing with private data you must approach it with the
same attitude that a C coder must take to avoid buffer overflows.
Treat all data as hostile, assume all actions are potentially
dangerous. Try to figure out how to break it, and think deviously.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Google web bugs in Mediawiki js from admins - technical workarounds?

2009-06-04 Thread Gregory Maxwell
On Thu, Jun 4, 2009 at 10:19 AM, David Gerard dger...@gmail.com wrote:
 Keeping well-meaning admins from putting Google web bugs in the
 JavaScript is a game of whack-a-mole.

 Are there any technical workarounds feasible? If not blocking the
 loading of external sites entirely (I understand hu:wp uses a web bug
 that isn't Google), perhaps at least listing the sites somewhere
 centrally viewable?

Restrict site-wide JS and raw HTML injection to a smaller subset of
users who have been specifically schooled in these issues.


This approach is also compatible with other approaches. It has the
advantage of being simple to implement and should produce a
considerable reduction in problems regardless of the underlying cause.


Just be glad no one has yet turned english wikipedia's readers into
their own personal DDOS drone network.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Google web bugs in Mediawiki js from admins - technical workarounds?

2009-06-04 Thread Gregory Maxwell
On Thu, Jun 4, 2009 at 10:53 AM, David Gerard dger...@gmail.com wrote:
 I understand the problem with stats before was that the stats server
 would melt under the load. Leon's old wikistats page sampled 1:1000.
 The current stats (on dammit.lt and served up nicely on
 http://stats.grok.se) are every hit, but I understand (Domas?) that it
 was quite a bit of work to get the firehose of data in such a form as
 not to melt the receiving server trying to process it.

 OK, then the problem becomes: how to set up something like
 stats.grok.se feasibly internally for all the other data gathered from
 a hit? (Modulo stuff that needs to be blanked per privacy policy.)

What exactly are people looking for that isn't available from
stats.grok.se that isn't a privacy concern?

I had assumed that people kept installing these bugs because they
wanted source network break downs per-article and other clear privacy
violations.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Google web bugs in Mediawiki js from admins - technical workarounds?

2009-06-04 Thread Gregory Maxwell
On Thu, Jun 4, 2009 at 11:01 AM, Mike.lifeguard
mikelifegu...@fastmail.fm wrote:
 On Thu, 2009-06-04 at 15:34 +0100, David Gerard wrote:

 Then external site loading can be blocked.


 Why do we need to block loading from all external sites? If there are
 specific  problematic ones (like google analytics) then why not block
 those?

Because:

(1) External loading results in an uncontrolled leak of private reader
and editor information to third parties, in contravention of the
privacy policy as well as basic ethical operating principles.

(1a) most external loading script usage will also defeat users choice
of SSL and leak more information about their browsing to their local
network. It may also bypass any wikipedia specific anonymization
proxies they are using to keep their reading habits private.

(2) External loading produces a runtime dependency on third party
sites. Some other site goes down and our users experience some kind of
loss of service.

(3) The availability of external loading makes Wikimedia a potential
source of very significant DDOS attacks, intentional or otherwise.

Thats not to say that there aren't reasons to use remote loading, but
the potential harms mean that it should probably be a default-deny
permit-by-exception process rather than the other way around.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] flagged revisions

2009-05-20 Thread Gregory Maxwell
On Wed, May 20, 2009 at 9:58 PM, Bart banati...@gmail.com wrote:
 I don't know about those flagged revisions.  After a while, it would
 basically mean that every edit and page view would be doubled.  For most
[snip]

Sorry to be curt, but why do people who have a weak understanding of
the functionality available feel so compelled to make comments like
this?

The software supports automatically preserving the standing flagging
(or some portion of it) when users with the authority to set those
flags make edits.  This eliminates the inherit doubling.

The flagging communicates to users that a revision has been reviewed
to some degree by an established user. This should allow review
resources to applied more effectively rather than having 100 people
review every change to a popular article while changes to less popular
articles end up insufficiently reviewed.

Furthermore, the existence of flagged versions in the history means
that when a series of unflagged revisions are made they can be
reviewed in a single action by viewing the diff against the the single
most recent 'known-probably-good' flagged revision.  Without these
points in the history every single edit must be individually reviewed.

The exact change in workload isn't clear: If there is an increase in
workload then it would come from an increase from performing a review
of changes by less-established users (those unable to set the flags)
which previously went completely without review.  I hope that there
isn't currently enough completely unreviewed material that it would
offset the time saving improvements of collaborative review and
known-good comparison points.


I'm sure that it is possible to find worthwhile criticisms of the
flagging functionality (or the particular configuration requested by
EnWP), but many people have worked very hard on this functionality and
many of most obvious possible problems have been addressed. To produce
an effective criticism you're going to need to spend a decent amount
of time researching, reading discussion history, trying the software,
etc. Maybe if you do you'll find that the functionality isn't as
frightening as you feared and hopefully you'll find a new possible
problem which can actually be addressed without rejecting this attempt
at forward progress.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Dealing with Large Files when attempting a wikipedia database download - Focus upon Bittorrent List

2009-04-18 Thread Gregory Maxwell
On Fri, Apr 17, 2009 at 9:42 PM, Gregory Maxwell gmaxw...@gmail.com wrote:
[snip]
 But if you are running parallel connections to avoid slowdowns you're
 just attempting to cheat TCP congestion control and get an unfair
 share of the available bandwidth.  That kind of selfish behaviour
 fuels non-neutral behaviour and ought not be encouraged.
[snip]
On Sat, Apr 18, 2009 at 3:06 AM, Brian brian.min...@colorado.edu wrote:
 I have no problem helping someone get a faster download speech and I'm also
 not willing to fling around fallacies about how selfish behavior is bad for
 society. Here is wget vs. aget for the full history dump of the simple
[snip]

And? I did point out this is possible, and that no torrent was
required to achieve this end. Thank you for validating my point.

Since you've called my position fallacious I figure I ought to give it
a reasonable defence, although we've gone off-topic.

The use of parallel TCP has allowed you an inequitable share of the
available network capacity[1]. The parallel transport is fundamentally
less efficient as it increases the total number of congestion
drops[2]. The categorical imperative would have us not perform
activities that would be harmful if everyone undertook them. At the
limit: If everyone attempted to achieve an unequal share of capacity
by running parallel connections the internet would suffer congestion
collapse[3].

Less philosophically and more practically: the unfair usage of
capacity by parallel fetching P2P tools is a primary reason for
internet providers to engage in 'non-neutral' activities such as
blocking or throttling this P2P traffic[4][5][6].  Ironically, a
provider which treats parallel transport technologies unfairly will be
providing a more fair network service and non-neutral handling of
traffic is the only way to prevent an (arguably unfair) redistribution
of transport towards end user heavy service providers.

(I highly recommend reading the material in [5] for a simple overview
of P2P fairness and network efficiency; as well as the Briscone IETF
draft in [4] for a detailed operational perspective)

Much of the public discussion on neutrality has focused on portraying
service providers considering or engaging in non-neutral activities as
greedy and evil. The real story is far more complicated and far less
clear cut.

Where this is on-topic is that non-neutral behaviour by service
providers may well make the Wikimedia Foundation's mission more costly
to practice in the future.  In my professional opinion I believe the
best defence against this sort of outcome available to organizations
like Wikimedia (and other large content houses) is the promotion of
equitable transfer mechanisms which avoid unduly burdening end user
providers and therefore providing an objective justification for
non-neutral behaviour.  To this end Wikimedia should not promote or
utilize cost shifting technology (such as P2P distribution) or
inherently unfair inefficient transmission (parallel TCP; or fudged
server-side initial window) gratuitously.

I spent a fair amount of time producing what I believe to be a well
cited reply which I believe stands well enough on its own that I
should not need to post any more in support of it. I hope that you
will at least put some thought into the issues I've raised here before
dismissing this position.  If my position is fallacious then numerous
academics and professionals in the industry are guilty of falling for
the same fallacies.


[1] Cho, S. 2006 Congestion Control Schemes for Single and Parallel
Tcp Flows in High Bandwidth-Delay Product Networks. Doctoral Thesis.
UMI Order Number: AAI3219144., Texas A  M University.
[2] Padhye, J., Firoiu, V. Towsley, D. and Kurose, J., Modeling TCP
throughput: a simple model and its empirical validation. ACMSIGCOMM,
Sept. 1998.
[3] Floyd, S., and Fall, K., Promoting the Use of End-to-End
Congestion Control in the Internet, IEEE/ACM Transactions on
Networking, Aug. 1999.
[4] B. Briscoe, T. Moncaster, L. Burness (BT),
http://tools.ietf.org/html/draft-briscoe-tsvwg-relax-fairness-01
[5] Nicholas Weaver presentation  Bulk Data P2P:
Cost Shifting, not Cost Savings
(http://www.icsi.berkeley.edu/~nweaver/p2pi_shifting.ppt); Nicholas
Weaver Position Paper P2PI Workshop http://www.funchords.com/p2pi/1
p2pi-weaver.txt
[6] Bruno Tuffin, Patrick Maillé: How Many Parallel TCP Sessions to
Open: A Pricing Perspective. ICQT 2006: 2-12

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Dealing with Large Files when attempting a wikipedia database download - Focus upon Bittorrent List

2009-04-17 Thread Gregory Maxwell
On Fri, Apr 17, 2009 at 6:10 PM, Chad innocentkil...@gmail.com wrote:
 I seem to remember there being a discussion about the
 torrenting issue before. In short: there's never been any
 official torrents, and the unofficial ones never got really
 popular.

Torrent isn't a very good transfer method for things which are not
fairly popular as it has a fair amount of overhead.

The wikimedia download site should be able to saturate your internet
connection in any case…

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Dealing with Large Files when attempting a wikipedia database download - Focus upon Bittorrent List

2009-04-17 Thread Gregory Maxwell
On Fri, Apr 17, 2009 at 9:21 PM, Stig Meireles Johansen
sti...@gmail.com wrote:
 But some ISP's throttle TCP-connections (either by design or by simple
 oversubscription and random packet drops), so many small connections *can*
 yield a better result for the end user. And if you are so unlucky as to
 having a crappy connection from your country to the download-site, maybe,
 just maybe someone in your own country already has downloaded it and is
 willing to share the torrent... :)
 I can saturate my little 1M ADSL-link with torrent-downloads, but forget
 about getting throughput when it comes to HTTP-requests... if it's in the
 country, in close proximity and the server is willing, then *maybe*.. but
 else.. no way.

There are plenty of downloading tools that will use range requests to
download a signal file with parallel connections…

But if you are running parallel connections to avoid slowdowns you're
just attempting to cheat TCP congestion control and get an unfair
share of the available bandwidth.  That kind of selfish behaviour
fuels non-neutral behaviour and ought not be encouraged.

We offered torrents in the past for commons picture of the year
results— a more popular thing to download, a much smaller file (~500mb
vs many gbytes), and not something which should become outdated every
month… and pretty much no one stayed connected long enough for anyone
else to manage to pull anything from them. It was an interesting
experiment, but it indicated that further use for these sorts of files
would be a waste of time.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Large nested templates (example: NYRepresentatives)

2009-04-14 Thread Gregory Maxwell
On Tue, Apr 14, 2009 at 10:52 AM, Sergey Chernyshev
sergey.chernys...@gmail.com wrote:
 Domas,

 In this particular case, template will just contain an SMW query to get all
 representatives.
[snip]

How does this avoid merely shifting the load from the parser (on the
plentiful application servers) to the database?

Not that more intelligence isn't good— From a content-maintenance
perspective something query based is doubtlessly better than some
static serialized lump, but the complaint here was performance as far
as I can tell, and I think thats a more complicated question.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] ANNOUNCE: OpenStreetMap maps will be added to Wikimedia projects

2009-04-05 Thread Gregory Maxwell
On Sun, Apr 5, 2009 at 10:12 PM, Brian brian.min...@colorado.edu wrote:
 Great. Let us know when you've got community approval.

Better than a simple super-majority too per the president set in the
recent discussions related to revision flagging.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Providing simpler dump format (raw, SQL or CSV)?

2009-03-31 Thread Gregory Maxwell
On Tue, Mar 31, 2009 at 10:02 AM, Christensen, Courtney
christens...@battelle.org wrote:
 -Original Message-
 Given that the current dump process is having problem, why not provide
 a simple fix such as providing raw table format , SQL files or even
 CSV files?

 Howard,

 Can't you get the SQL files from running mysqldump from the command line?  
 Why does something new need to be created?  I hope I'm not being dense, but I 
 don't understand what new niche you are asking to fill.

Because the data (text) isn't in a single database, even for a single
project, it is spread across a large number of machines. It's also in
a mixture of bizarre internal formats.

The file format it pretty much irrelevant to the 'cost' of producing a dump.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] PDF vulnerability

2009-02-20 Thread Gregory Maxwell
On Fri, Feb 20, 2009 at 12:57 PM, Platonides platoni...@gmail.com wrote:
[snip]
 It could also pass a virus scan but I don't think it's really needed.
 Virus scanners mainly look for known bad code, inside executables. We
 don't want any kind of executable.

I've run clamav against the entire set of files in the past. Found a
couple of interesting things (like, 3 files out of millions).


Converting pdftops and back will probably totally kill the text layer.
Might as well render to images and djvu.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Javascript localization, minify, gzip cache forever

2009-02-20 Thread Gregory Maxwell
On Fri, Feb 20, 2009 at 5:51 PM, Brion Vibber br...@wikimedia.org wrote:
[snip]
 On the other hand we don't want to delay those interactions; it's
 probably cheaper to load 15 messages in one chunk after showing the
 wizard rather than waiting until each tab click to load them 5 at a time.

 But that can be up to the individual component how to arrange its loads...

Right. It's important to keep in mind that in most cases the user is
*latency bound*.
That is to say that the RTT between them and the datacenter is the primary
determining factor in the load time, not how much data is sent.

Latency determines the connection time, it also influences how quickly
rwin can grow
and get you out of slow-start. When you send more at once you'll also be sending
more of it with a larger rwin.

So in terms of user experience you'll usually improve results by sending more
data if doing so is able to save you a second request.

Even ignoring the users experience— connections aren't free. There is
byte-overhead
in establishing a connection. Byte-overhead in lost compression by working with
smaller objects. Byte-overhead in having more partially filled IP packets. CPU
overhead from processing more connections, etc.

Obviously there is a line to be drawn— You wouldn't improve
performance by sending
the whole of Wikipedia on the first request. But you will most likely
not be conserving
*anything* by avoiding sending another kilobyte of compressed user
interface text for
an application a user has already invoked, even if only a few percent use the
additional messages.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] inconsistent precision in PHP output

2009-02-11 Thread Gregory Maxwell
On Wed, Feb 11, 2009 at 12:29 PM, Robert Rohde raro...@gmail.com wrote:
 Yes Domas, haha, because no one would ever want to write about math or
 high precision scientific measurements in an encyclopedia.

Holy crud!  You don't use floating point for this!  If you need
deterministic behaviour and high accuracy you need to confine yourself
to integer mathematics.

Sure, *Write about* high precision scientific measurements in
Wikipedia, but don't use Wikipedia to *make them*.

[snip]
 Am I wrong in thinking that the server admins should care when
 different machines produce different output from the same code?  In
 this case, the behavior suggests it may be as simple as ensuring that
 the servers have the same php.ini precision settings.

Is there any reason to think that this is related to to a PHP setting
rather than being a result of differences in compiler decisions with
respect to moving variables in in off the x87 stack and into memory or
the use of SSE?   Or some libc difference in how the FPU rounding mode
is set?

At 12 digits you are beyond the expected precision of single precision
floating point, and not far from what you get with doubles.  On x86
the delivered precision can vary wildly depending on the precise
sequence of calculations are register spills.  For code compiled
without -ffast-math the former should be stable for a single piece of
code, but the latter is anyone's guess.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] – Fixing {val}

2009-02-02 Thread Gregory Maxwell
On Sat, Jan 31, 2009 at 8:33 PM, Robert Rohde raro...@gmail.com wrote:
 This discussion is getting side tracked.

 The real complaint here is that

 {{#expr:(0.7 * 1000 * 1000) mod 1000}} is giving 69 when it should give 
 70.

 This is NOT a formatting issue, but rather it is bug in the #expr
 parser function, presumably caused by some kind of round-off error.

It's a bug in the user's understanding of floating point on computers,
combined with % being (quite naturally) an operator on integers.

0.7… does not exist in your finite precision base-2 based computer.

I don't think it's reasonable for Mediawiki to include a full radix-n
multi-precision floating point library in order to capture the
expected your behavior for these cases, any more than it would be
reasonable to expect it to contain a full computer algebra system so
it could handle manipulations of irrationals precisely.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

  1   2   >