Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-11-27 Thread John Erling Blad
I have no idea about the schema changes, but to choose a digest for
detection of identity reverts is pretty simple. The really difficult
part is to choose a locally sensitive hash or fingerprint that works
for very similar revisions with a lot of content.

I would propose that the digest is stored in the database, and that a
lsh or fingerprint is calculated on the fly by the API, unless someone
can find a really good way to make and store a lsh or fingerprint that
has all necessary properties.

For all the purposes I know (and care) about the digest will be used
for detection of identity reverts, while the lsh/fingerprint will be
used for resynchronization after difficult partly reverts. In addition
it seems likely that fingerprints are necessary for more fine-grained
analysis.

It seems like the necessary properties for lsh and the fingerprint
scales with increasing content, that makes it difficult to precompute
a value.

John

On Mon, Nov 28, 2011 at 2:28 AM, Tim Starling  wrote:
> On 28/11/11 08:29, Brion Vibber wrote:
>> So... this seems to have snuck back in a month ago:
>> https://www.mediawiki.org/wiki/Special:Code/MediaWiki/101021
>>
>> https://bugzilla.wikimedia.org/show_bug.cgi?id=21860
>
> I don't think it really "snuck", Rob has been talking about it for a
> while, see e.g. comment 27.
>
>> Have we resolved the deployment questions on how to actually do the change?
>> Just want to make sure ops has plenty of warning before 1.19 comes down the
>> pipe. (Especially if we have to revert anything back to 1.18 during/after!)
>
> It can be deployed like any column addition to a large table: on the
> slaves first, then switch masters, then on the old masters. For 1.17
> we changed categorylinks (60M rows on enwiki), and that caused no
> problems. In 1.18 the schema changes were done by ops (Asher), and
> included flaggedrevs which is 30M rows on dewiki.
>
> The revision table is 320M rows on enwiki, but it doesn't pose any
> special challenges, as long as there's enough disk space. The snapshot
> host db26 is the only host which may possibly be in danger of running
> out of space, but if its snapshots are deleted and the space
> reallocated to /a then it won't have any trouble.
>
> Like the previous schema changes, this schema change will be done in
> advance of the software version change. The old version will work with
> the new schema, and the default value is harmless, so reverting back
> to 1.18 or restarting the populate script won't be a problem.
>
> -- Tim Starling
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-11-27 Thread Tim Starling
On 28/11/11 08:29, Brion Vibber wrote:
> So... this seems to have snuck back in a month ago:
> https://www.mediawiki.org/wiki/Special:Code/MediaWiki/101021
> 
> https://bugzilla.wikimedia.org/show_bug.cgi?id=21860

I don't think it really "snuck", Rob has been talking about it for a
while, see e.g. comment 27.

> Have we resolved the deployment questions on how to actually do the change?
> Just want to make sure ops has plenty of warning before 1.19 comes down the
> pipe. (Especially if we have to revert anything back to 1.18 during/after!)

It can be deployed like any column addition to a large table: on the
slaves first, then switch masters, then on the old masters. For 1.17
we changed categorylinks (60M rows on enwiki), and that caused no
problems. In 1.18 the schema changes were done by ops (Asher), and
included flaggedrevs which is 30M rows on dewiki.

The revision table is 320M rows on enwiki, but it doesn't pose any
special challenges, as long as there's enough disk space. The snapshot
host db26 is the only host which may possibly be in danger of running
out of space, but if its snapshots are deleted and the space
reallocated to /a then it won't have any trouble.

Like the previous schema changes, this schema change will be done in
advance of the software version change. The old version will work with
the new schema, and the default value is harmless, so reverting back
to 1.18 or restarting the populate script won't be a problem.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-11-27 Thread Brion Vibber
So... this seems to have snuck back in a month ago:
https://www.mediawiki.org/wiki/Special:Code/MediaWiki/101021

https://bugzilla.wikimedia.org/show_bug.cgi?id=21860

Have we resolved the deployment questions on how to actually do the change?
Just want to make sure ops has plenty of warning before 1.19 comes down the
pipe. (Especially if we have to revert anything back to 1.18 during/after!)

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 7:20 PM, Anthony  wrote:
> On Sun, Sep 18, 2011 at 7:07 PM, bawolff  wrote:
>> Anthony wrote:
>> The pages you link to seem to indicate he's nothing more than a
>> willy-on-wheels type vandal, who at worst tricked an admin into doing
>> a delete of a page with a very high number of revisions making the
>> server kittens cry for a moment. There's no indication he has "mad
>> hacker skillz" in any way or form (and given the tone of that
>> Encyclopedia Dramatica page, I assume they'd be bragging about it if
>> he did).
>
> As I said, I couldn't find a page which described it in detail.  Maybe
> if you look at archive.org?

By the way, my comment about "mad hacker skillz" was meant to be
sarcastic.  The term "script kiddie" is probably more accurate.

I don't know how the person did it.  I don't know whether they were
*the* Grawp or just a copycat.  I don't know if they found a parsing
bug, or they found a backdoor through a default password, or if they
hacked my account password (*).  I even don't know if it was
javascript or style sheets or gabagool or whatever the hell.  All I
know is that he fucked up my site so bad I didn't know how to fix it
(other than restoring the database, which I didn't feel like doing).
I asked someone to take a look at the site, and he said I was attacked
by Grawp and I needed to upgrade my Mediawiki.  At that point I said
"fuck it, I'm just going to host a few pages at Knol, and just take
down the rest".

(*) I believe it was the former, though, because when I looked at the
database the page edits were made by a regular user, not by me, and
not by a special account.

And all of this is irrelevant.  Generating an MD5 collision does not
in any way involve "mad hacker skillz".

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 7:07 PM, bawolff  wrote:
> Anthony wrote:
>> It does not involve generating hash collisions, but it involves
>> finding various bugs in mediawiki and using them to vandalise, often
>> by injecting javascript.  The best description I could find was at
>> Encyclopedia Dramatica, which seems to be taken down (there's a cache
>> if you do a google search for "grawp wikipedia").  There's also a
>> description at http://en.wikipedia.org/wiki/User:Grawp , which does
>> not do justice to the "mad hacker skillz" of this individual and his
>> intent on finding bugs in mediawiki and exploiting them.
>>
>
> Say what? Being able to inject js is a very serious vulnerability. If
> he's doing this, why haven't I seen any security releases triggered by
> a vandal finding an XSS? has no one reported it?

I have no idea.  How long have you been reading the release notes?
This was a few years ago that this happened to me, and the software I
was using was probably a year or two old.

I didn't investigate into the details of the bug.  I didn't have the
time to do that, which is why I just took the site down rather than
bother.

> The pages you link to seem to indicate he's nothing more than a
> willy-on-wheels type vandal, who at worst tricked an admin into doing
> a delete of a page with a very high number of revisions making the
> server kittens cry for a moment. There's no indication he has "mad
> hacker skillz" in any way or form (and given the tone of that
> Encyclopedia Dramatica page, I assume they'd be bragging about it if
> he did).

As I said, I couldn't find a page which described it in detail.  Maybe
if you look at archive.org?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread bawolff
Anthony wrote:
> It does not involve generating hash collisions, but it involves
> finding various bugs in mediawiki and using them to vandalise, often
> by injecting javascript.  The best description I could find was at
> Encyclopedia Dramatica, which seems to be taken down (there's a cache
> if you do a google search for "grawp wikipedia").  There's also a
> description at http://en.wikipedia.org/wiki/User:Grawp , which does
> not do justice to the "mad hacker skillz" of this individual and his
> intent on finding bugs in mediawiki and exploiting them.
>

Say what? Being able to inject js is a very serious vulnerability. If
he's doing this, why haven't I seen any security releases triggered by
a vandal finding an XSS? has no one reported it?

The pages you link to seem to indicate he's nothing more than a
willy-on-wheels type vandal, who at worst tricked an admin into doing
a delete of a page with a very high number of revisions making the
server kittens cry for a moment. There's no indication he has "mad
hacker skillz" in any way or form (and given the tone of that
Encyclopedia Dramatica page, I assume they'd be bragging about it if
he did).

-bawolff

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 6:01 PM, Anthony  wrote:
> There's also a
> description at http://en.wikipedia.org/wiki/User:Grawp , which does
> not do justice to the "mad hacker skillz" of this individual and his
> intent on finding bugs in mediawiki and exploiting them.

(and/or the Grawp copycats - personally I don't know if it was "Grawp"
himself or a copycat that attacked my wiki)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 5:50 PM, Chad  wrote:
> On Sun, Sep 18, 2011 at 5:47 PM, Anthony  wrote:
>> On Sun, Sep 18, 2011 at 5:30 PM, Chad  wrote:
>>> On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
>>>  wrote:
 It is meaningless to talk about cryptography without a threat model, just 
 as Robert says. Is anybody actually attacking us? Or are we worried about 
 accidental collisions?

>>>
>>> I believe it began as accidental collisions, then everyone promptly
>>> put on their tinfoil hats and started talking about a hypothetical
>>> vandal who has the time and desire to generate hash collisions.
>>
>> Having run a wiki which I eventually abandoned due to various "Grawp
>> attacks", I can assure you that there's nothing hypothetical about it.
>>
>
> For those of us who do not know...what the heck is a Grawp attack?
> Does it involve generating hash collisions?

It does not involve generating hash collisions, but it involves
finding various bugs in mediawiki and using them to vandalise, often
by injecting javascript.  The best description I could find was at
Encyclopedia Dramatica, which seems to be taken down (there's a cache
if you do a google search for "grawp wikipedia").  There's also a
description at http://en.wikipedia.org/wiki/User:Grawp , which does
not do justice to the "mad hacker skillz" of this individual and his
intent on finding bugs in mediawiki and exploiting them.

If you did something as lame as relying on no one generating an MD5
collision (*), it would happen.  If you use SHA-1, it may or may not
happen, depending on how quickly computers get faster, and how many
further attacks are made on the algorithm.  If you use SHA-256 (**),
it's significantly less likely to happen, and you'll probably have a
warning in the form of an announcement on Slashdot that SHA-256 has
been broken, before it happens.

(*) Something which I have done myself on my home computer in a couple
minutes, and apparently now can be done in a couple seconds.

(**) Which, incidentally, is possibly the single most secure hash for
Wikimedia to use at the current time.  SHA-512 is significantly more
"broken" than SHA-256, and the more theoretically secure hashes have
received much less scrutiny than SHA-256.  If you want to be more
secure than SHA-256, you should combine SHA-256 with some other
hashing algorithm.)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Platonides
Chad wrote:
> For those of us who do not know...what the heck is a Grawp attack?
> Does it involve generating hash collisions?
>
> -Chad

It's the name of a wikipedia vandal.
http://en.wikipedia.org/wiki/User:Grawp



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Chad
On Sun, Sep 18, 2011 at 5:47 PM, Anthony  wrote:
> On Sun, Sep 18, 2011 at 5:30 PM, Chad  wrote:
>> On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
>>  wrote:
>>> It is meaningless to talk about cryptography without a threat model, just 
>>> as Robert says. Is anybody actually attacking us? Or are we worried about 
>>> accidental collisions?
>>>
>>
>> I believe it began as accidental collisions, then everyone promptly
>> put on their tinfoil hats and started talking about a hypothetical
>> vandal who has the time and desire to generate hash collisions.
>
> Having run a wiki which I eventually abandoned due to various "Grawp
> attacks", I can assure you that there's nothing hypothetical about it.
>

For those of us who do not know...what the heck is a Grawp attack?
Does it involve generating hash collisions?

-Chad

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 5:30 PM, Chad  wrote:
> On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
>  wrote:
>> It is meaningless to talk about cryptography without a threat model, just as 
>> Robert says. Is anybody actually attacking us? Or are we worried about 
>> accidental collisions?
>>
>
> I believe it began as accidental collisions, then everyone promptly
> put on their tinfoil hats and started talking about a hypothetical
> vandal who has the time and desire to generate hash collisions.

Having run a wiki which I eventually abandoned due to various "Grawp
attacks", I can assure you that there's nothing hypothetical about it.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Chad
On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
 wrote:
> It is meaningless to talk about cryptography without a threat model, just as 
> Robert says. Is anybody actually attacking us? Or are we worried about 
> accidental collisions?
>

I believe it began as accidental collisions, then everyone promptly
put on their tinfoil hats and started talking about a hypothetical
vandal who has the time and desire to generate hash collisions.

-Chad

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 2:33 AM, Ariel T. Glenn  wrote:
> Στις 17-09-2011, ημέρα Σαβ, και ώρα 22:55 -0700, ο/η Robert Rohde
> έγραψε:
>> On Sat, Sep 17, 2011 at 4:56 PM, Anthony  wrote:
>
> 
>
>> > For offline analyses, there's no need to change the online database tables.
>>
>> Need?  That's debatable, but one of the major motivators is the desire
>> to have hash values in database dumps (both for revert checks and for
>> checksums on correct data import / export).  Both of those are
>> "offline" uses, but it is beneficial to have that information
>> precomputed and stored rather than frequently regenerated.
>
> If we don't have it in the online database tables, this defeats the
> purpose of having the value in there at all, for the purpose of
> generating the XML dumps.
>
> Recall that the dumps are generated in two passes; in the first pass we
> retrieve from the db and record all of the metadata about revisions, and
> in the second (time-comsuming) pass we re-use the text of the revisions
> from a previous dump file if the text is in there.  We want to compare
> the has of that text against what the online database says the hash is;
> if they don't match, we want to fetch the live copy.

Well, this is exactly the type of use in which collisions do matter.
Do you really want the dump to not record the correct data when some
miscreant creates an intentional collision?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
 wrote:
> It is meaningless to talk about cryptography without a threat model, just as 
> Robert says. Is
> anybody actually attacking us?

You mean, like Grawp?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Russell N. Nelson - rnnelson
What is the threat?

Sent from my Verizon Wireless Phone


-Original message-
From: Ilmari Karonen 
To: Wikimedia developers 
Sent: Sun, Sep 18, 2011 20:20:34 GMT+00:00
Subject: Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table 
(discussing r94289)

On 09/18/2011 08:55 AM, Robert Rohde wrote:
> people find ways to improve the attacks on SHA-1.  (The existing
> attacks usually require the ability to feed arbitrary binary strings
> into the hash function.  Given that both browsers and Mediawiki will
> tend to reject binary data placed in an edit window, I'm not sure if
> any of the existing attacks could be reliably applied to Mediawiki
> editing.)

I'm pretty sure MediaWiki will accept any data that's valid UTF-8,
modulo canonicalization perhaps.  I'm not very familiar with the MD5 and
SHA-1 collision attacks, but I wouldn't be surprised if at least some of
them could be modified to use, say, only 7-bit ASCII.


> If collision attacks really matter we should use SHA-1.  However, do
> any of the proposed use cases care about whether someone might
> intentionally inject a collision?  In the proposed uses I've looked at
> it, it seems irrelevant.  The intentional collision will get flagged
> as a revert and the text leading to that collision would be discarded.
>   How is that a bad thing?

Well, if you could predict the content of a version that someone (say, a
bot) was likely to save sometime in the future, and created a different
revision with the same hash (say, in the sandbox or in your userspace,
so that people wouldn't notice it) in advance...

Depending on just what page was targeted, the consequences could range
from a minor annoyance to site-wide JS injection.

Anyway, I wouldn't suggest using either MD5 or SHA-1: both have known
attacks, and it's a fundamental rule of cryptography that attacks always
get better over time, never worse.  Let's _at least_ use SHA-2.

(Actually, I'd suggest designing the format so that we can change hash
functions in the future without having to rehash every old revision
immediately.  For example, we might store a hash computed using SHA-256
as "sha256:d9014c4624844aa..." or something like that.)

--
Ilmari Karonen

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Ilmari Karonen
On 09/18/2011 08:55 AM, Robert Rohde wrote:
> people find ways to improve the attacks on SHA-1.  (The existing
> attacks usually require the ability to feed arbitrary binary strings
> into the hash function.  Given that both browsers and Mediawiki will
> tend to reject binary data placed in an edit window, I'm not sure if
> any of the existing attacks could be reliably applied to Mediawiki
> editing.)

I'm pretty sure MediaWiki will accept any data that's valid UTF-8, 
modulo canonicalization perhaps.  I'm not very familiar with the MD5 and 
SHA-1 collision attacks, but I wouldn't be surprised if at least some of 
them could be modified to use, say, only 7-bit ASCII.


> If collision attacks really matter we should use SHA-1.  However, do
> any of the proposed use cases care about whether someone might
> intentionally inject a collision?  In the proposed uses I've looked at
> it, it seems irrelevant.  The intentional collision will get flagged
> as a revert and the text leading to that collision would be discarded.
>   How is that a bad thing?

Well, if you could predict the content of a version that someone (say, a 
bot) was likely to save sometime in the future, and created a different 
revision with the same hash (say, in the sandbox or in your userspace, 
so that people wouldn't notice it) in advance...

Depending on just what page was targeted, the consequences could range 
from a minor annoyance to site-wide JS injection.

Anyway, I wouldn't suggest using either MD5 or SHA-1: both have known 
attacks, and it's a fundamental rule of cryptography that attacks always 
get better over time, never worse.  Let's _at least_ use SHA-2.

(Actually, I'd suggest designing the format so that we can change hash 
functions in the future without having to rehash every old revision 
immediately.  For example, we might store a hash computed using SHA-256 
as "sha256:d9014c4624844aa..." or something like that.)

-- 
Ilmari Karonen

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Russell N. Nelson - rnnelson
It is meaningless to talk about cryptography without a threat model, just as 
Robert says. Is anybody actually attacking us? Or are we worried about 
accidental collisions?

Sent from my Verizon Wireless Phone


-Original message-
From: Robert Rohde 
To: Wikimedia developers 
Sent: Sun, Sep 18, 2011 05:56:15 GMT+00:00
Subject: Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table 
(discussing r94289)

On Sat, Sep 17, 2011 at 4:56 PM, Anthony  wrote:
> On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde  wrote:
>> Is there a good reason to prefer SHA-1?
>>
>> Both have weaknesses allowing one to construct a collision (with
>> considerable effort)
>
> Considerable effort?  I can create an MD5 collision in a few minutes
> on my home computer.  Is there anything even remotely like this for
> SHA-1?

If I've been keeping up to date, the collision complexity for MD5 is
about 2^21 operations, and runs in a few seconds (not minutes); and
for SHA-1 down to about 2^52 with current results.  The latter
represents about 100 cpu-years, which is within the realm of
supercomputers.  That time will probably continue to come down if
people find ways to improve the attacks on SHA-1.  (The existing
attacks usually require the ability to feed arbitrary binary strings
into the hash function.  Given that both browsers and Mediawiki will
tend to reject binary data placed in an edit window, I'm not sure if
any of the existing attacks could be reliably applied to Mediawiki
editing.)

If collision attacks really matter we should use SHA-1.  However, do
any of the proposed use cases care about whether someone might
intentionally inject a collision?  In the proposed uses I've looked at
it, it seems irrelevant.  The intentional collision will get flagged
as a revert and the text leading to that collision would be discarded.
 How is that a bad thing?

It's a not a big deal, but if I understand prior comments correctly,
most of the existing offline infrastructure uses MD5, so I'm wondering
if there is a distinct use case for favoring SHA-1.

>> MD5 is shorter and in my experience about 25% faster to compute.
>>
>> Personally I've tended to view MD5 as more than good enough in offline 
>> analyses.
>
> For offline analyses, there's no need to change the online database tables.

Need?  That's debatable, but one of the major motivators is the desire
to have hash values in database dumps (both for revert checks and for
checksums on correct data import / export).  Both of those are
"offline" uses, but it is beneficial to have that information
precomputed and stored rather than frequently regenerated.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-17 Thread Ariel T. Glenn
Στις 17-09-2011, ημέρα Σαβ, και ώρα 22:55 -0700, ο/η Robert Rohde
έγραψε:
> On Sat, Sep 17, 2011 at 4:56 PM, Anthony  wrote:



> > For offline analyses, there's no need to change the online database tables.
> 
> Need?  That's debatable, but one of the major motivators is the desire
> to have hash values in database dumps (both for revert checks and for
> checksums on correct data import / export).  Both of those are
> "offline" uses, but it is beneficial to have that information
> precomputed and stored rather than frequently regenerated.

If we don't have it in the online database tables, this defeats the
purpose of having the value in there at all, for the purpose of
generating the XML dumps.  

Recall that the dumps are generated in two passes; in the first pass we
retrieve from the db and record all of the metadata about revisions, and
in the second (time-comsuming) pass we re-use the text of the revisions
from a previous dump file if the text is in there.  We want to compare
the has of that text against what the online database says the hash is;
if they don't match, we want to fetch the live copy.  

I refer folks to bug 23264 [1] as proof that mismatch between the
metadata and the text has crept in in the past; changes to MW code in
other places than the backups scripts can cause quite subtle breakage.

Ariel

[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=23264



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-17 Thread Robert Rohde
On Sat, Sep 17, 2011 at 4:56 PM, Anthony  wrote:
> On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde  wrote:
>> Is there a good reason to prefer SHA-1?
>>
>> Both have weaknesses allowing one to construct a collision (with
>> considerable effort)
>
> Considerable effort?  I can create an MD5 collision in a few minutes
> on my home computer.  Is there anything even remotely like this for
> SHA-1?

If I've been keeping up to date, the collision complexity for MD5 is
about 2^21 operations, and runs in a few seconds (not minutes); and
for SHA-1 down to about 2^52 with current results.  The latter
represents about 100 cpu-years, which is within the realm of
supercomputers.  That time will probably continue to come down if
people find ways to improve the attacks on SHA-1.  (The existing
attacks usually require the ability to feed arbitrary binary strings
into the hash function.  Given that both browsers and Mediawiki will
tend to reject binary data placed in an edit window, I'm not sure if
any of the existing attacks could be reliably applied to Mediawiki
editing.)

If collision attacks really matter we should use SHA-1.  However, do
any of the proposed use cases care about whether someone might
intentionally inject a collision?  In the proposed uses I've looked at
it, it seems irrelevant.  The intentional collision will get flagged
as a revert and the text leading to that collision would be discarded.
 How is that a bad thing?

It's a not a big deal, but if I understand prior comments correctly,
most of the existing offline infrastructure uses MD5, so I'm wondering
if there is a distinct use case for favoring SHA-1.

>> MD5 is shorter and in my experience about 25% faster to compute.
>>
>> Personally I've tended to view MD5 as more than good enough in offline 
>> analyses.
>
> For offline analyses, there's no need to change the online database tables.

Need?  That's debatable, but one of the major motivators is the desire
to have hash values in database dumps (both for revert checks and for
checksums on correct data import / export).  Both of those are
"offline" uses, but it is beneficial to have that information
precomputed and stored rather than frequently regenerated.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-17 Thread Anthony
On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde  wrote:
> Is there a good reason to prefer SHA-1?
>
> Both have weaknesses allowing one to construct a collision (with
> considerable effort)

Considerable effort?  I can create an MD5 collision in a few minutes
on my home computer.  Is there anything even remotely like this for
SHA-1?

> MD5 is shorter and in my experience about 25% faster to compute.
>
> Personally I've tended to view MD5 as more than good enough in offline 
> analyses.

For offline analyses, there's no need to change the online database tables.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-17 Thread Robert Rohde
On Sat, Sep 17, 2011 at 8:26 AM, Roan Kattouw  wrote:
> Minor detail: I think it's more likely we'll use SHA-1 hashes rather
> than MD5 hashes.

Is there a good reason to prefer SHA-1?

Both have weaknesses allowing one to construct a collision (with
considerable effort), but I wouldn't see why that would matter for the
proposed use.

With only about 1 billion revisions in the collective databases, the
odds of an accidental collision with either MD5 or SHA-1 is
infinitesimal (less than 1 in 10^18 for the weaker MD5).

MD5 is shorter and in my experience about 25% faster to compute.

Personally I've tended to view MD5 as more than good enough in offline analyses.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-17 Thread Platonides

Roan Kattouw wrote:
> On Fri, Sep 16, 2011 at 6:48 PM, Thomas Gries  wrote:
>> Was there a certain reason to chose base 36 ?
>> Why not recoding to base 62 and saving 3 bytes per checksum ?
>>
> I don't know, this was way, way before my time. But then, why use base
> 62 if you can use base 64? Encoders/decoders for that are much more
> widely available. PHP's base_convert() will go up to 36, I think, so
> you'd have to write a base 62 encoder/decoder yourself, but PHP has
> built-in functions to deal with base 64.

Brion explained it on previous mail. And yes, we have our own converter 
in MediaWiki.




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-17 Thread Roan Kattouw
On Fri, Sep 16, 2011 at 6:48 PM, Thomas Gries  wrote:
> Was there a certain reason to chose base 36 ?
> Why not recoding to base 62 and saving 3 bytes per checksum ?
>
I don't know, this was way, way before my time. But then, why use base
62 if you can use base 64? Encoders/decoders for that are much more
widely available. PHP's base_convert() will go up to 36, I think, so
you'd have to write a base 62 encoder/decoder yourself, but PHP has
built-in functions to deal with base 64.

> base 16 = CHAR(32) md5 sum in hexadecimal
> base 36 = CHAR(25)
> base 62 = CHAR(22)
>
Minor detail: I think it's more likely we'll use SHA-1 hashes rather
than MD5 hashes.

Roan Kattouw (Catrope)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-16 Thread Brion Vibber
On Fri, Sep 16, 2011 at 9:48 AM, Thomas Gries  wrote:

> Am 16.09.2011 11:24, schrieb Roan Kattouw:
> > For some applications, I use the technique of representing the 128 bit
> > of md5 or other checksums
> >
> >as base-62 character strings
> >instead of hexadecimal (base-16) strings.
>
> > MediaWiki already uses a similar technique, storing SHA-1 hashes of
> > images in base 36.
> Was there a certain reason to chose base 36 ?
> Why not recoding to base 62 and saving 3 bytes per checksum ?
>

This format was chosen for hashes to be used as filenames for uploaded file
storage (currently used only for storing deleted files, I think, but there's
long been a long-term plan to switch primary image storage to this as well
some day).

For greatest compatibility with all filesystems, we only use characters that
are safe (ASCII digits and letters) and don't rely on case distinctions
which are not always preserved (Windows and Mac OS X systems default to
case-insensitive filesystems).

The reason we're not using hex here is that a more compact representation
makes the filenames, and thus any URL references including them in the path,
shorter. On img_sha1 I guess we just kept using it for
compatibility/similarity with the deleted file archives?

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-16 Thread Daniel Friesen
On 11-09-16 09:48 AM, Thomas Gries wrote:
> Am 16.09.2011 11:24, schrieb Roan Kattouw:
>> For some applications, I use the technique of representing the 128 bit
>> of md5 or other checksums
>>
>>as base-62 character strings
>>instead of hexadecimal (base-16) strings.
>> MediaWiki already uses a similar technique, storing SHA-1 hashes of
>> images in base 36.
> Was there a certain reason to chose base 36 ?
> Why not recoding to base 62 and saving 3 bytes per checksum ?
>
> base 16 = CHAR(32) md5 sum in hexadecimal
> base 36 = CHAR(25)
> base 62 = CHAR(22)
>
> using base 62 (upper and lower ASCII letters and digits) would save 3
> bytes over base 36.
> if my calculator works correctly ;-)
If we're picking apart checksum sizes in database storage. Why not just
go all the way and store the binary data as binary with a BINARY(16)
16, 36, 62... what happened to base 64 anyways? php even has native code
for that.

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-16 Thread Neil Kandalgaonkar
On 9/16/11 9:48 AM, Thomas Gries wrote:
> Am 16.09.2011 11:24, schrieb Roan Kattouw:
>> For some applications, I use the technique of representing the 128 bit
>> of md5 or other checksums
>>
>> as base-62 character strings
>> instead of hexadecimal (base-16) strings.
>
>> MediaWiki already uses a similar technique, storing SHA-1 hashes of
>> images in base 36.
> Was there a certain reason to chose base 36 ?
> Why not recoding to base 62 and saving 3 bytes per checksum ?

11M images in commons x 3 bytes = 33MB / ~1MB per file on average =
a savings equivalent to 33 more files in Commons

In the time it took you to write this we probably had more than 33 files 
uploaded.

-- 
Neil Kandalgaonkar (|  

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-16 Thread Thomas Gries
Am 16.09.2011 11:24, schrieb Roan Kattouw:
> For some applications, I use the technique of representing the 128 bit
> of md5 or other checksums
>
>as base-62 character strings
>instead of hexadecimal (base-16) strings.

> MediaWiki already uses a similar technique, storing SHA-1 hashes of
> images in base 36.
Was there a certain reason to chose base 36 ?
Why not recoding to base 62 and saving 3 bytes per checksum ?

base 16 = CHAR(32) md5 sum in hexadecimal
base 36 = CHAR(25)
base 62 = CHAR(22)

using base 62 (upper and lower ASCII letters and digits) would save 3
bytes over base 36.
if my calculator works correctly ;-)





signature.asc
Description: OpenPGP digital signature
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-16 Thread Roan Kattouw
On Fri, Sep 16, 2011 at 8:15 AM, Thomas Gries  wrote:
> For some applications, I use the technique of representing the 128 bit
> of md5 or other checksums
>
>        as base-62 character strings
>        instead of hexadecimal (base-16) strings.
>
MediaWiki already uses a similar technique, storing SHA-1 hashes of
images in base 36.

Roan

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-15 Thread Thomas Gries
RE:
http://www.mediawiki.org/wiki/Requests_for_comment/Database_field_for_checksum_of_page_text#Field_type

Recently, Adding MD5 / SHA1 column to revision table (discussing r94289)
was discussed.

For some applications, I use the technique of representing the 128 bit
of md5 or other checksums

as base-62 character strings
instead of hexadecimal (base-16) strings.

When you need a non-binary, i.e. character representation for
displaying, storing or transmitting checksums,
you should always consider to use base-62 instead of base-16
representations (for md5 you need 22 bytes instead of 32 bytes).

When you encode the 128 bits of MD5 (example) in base-16 aka
/hexadecimal/, you need CHAR(32).
When you use the technique of enoding the 128 bits in /base-62/ with
characters from [A-Za-z0-9], you'll need CHAR(22).

See http://de.php.net/manual/en/function.md5.php#83321 for one possible
implementation.

I found this for example first in Microsofts free FCIV program which
creates checksum xml files and uses this shorter representation.




signature.asc
Description: OpenPGP digital signature
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-04 Thread Diederik van Liere
Thanks for moving the page.
Diederik

On 2011-09-04, at 3:29 PM, Krinkle wrote:

> 2011/9/4 MZMcBride 
> 
>> Diederik van Liere wrote:
>>> I've suggested to generate bulk checksums as well but both Brion and
>> Ariel see
>>> the primary purpose of this field to check the validity of the dump
>> generating
>>> process and so they want to generate the checksums straight from the
>> external
>>> storage.
>>> 
>>> [...]
>>> 
>>> PS: not sure if this proposal should be on strategy or mediawiki...
>> 
>> I think standard practice nowadays is a subpage of
>> .
>> 
>> MZMcBride
>> 
>> 
> Indeed. Moved:
> http://mediawiki.org/wiki/Requests_for_comment/Database_field_for_checksum_of_page_text
> 
> 
> – Krinkle
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-04 Thread Aryeh Gregor
On Sat, Sep 3, 2011 at 12:33 AM, Rob Lanphier  wrote:
> I generally suspect that a standard index is going to be a waste for
> the most urgent uses of this.  It will rarely be interesting to search
> for common hashes between articles.  The far more common case will be
> to search for duplicate hashes within the history of a single article.
>  My understanding is that having a normal index on a table the size of
> our revision table will be far too expensive for db writes.

If it's useful for Wikimedia and the index is expensive to maintain,
the index can be kept on only one slave per DB cluster to minimize
cost, the way we've always done for RC.  If it's not useful for
Wikimedia, only third-party researchers, then we could leave it
unindexed on the cluster and have an index only on the toolserver or
whatnot.

It's not so much that an extra index would be untenably expensive in
principle, more that it's not worth the cost if it's not used for
anything important.  If it's just for statistical analysis and not for
the live site, there's no reason to have it on all servers, except
maybe administrative simplicity.  The toolserver already has a bunch
of indexes that Wikimedia doesn't, for exactly this sort of reason.

> This is the first I've heard of partial indexes (/me researches)  I
> don't know if a partial index is going to be cheap enough that we can
> use it, and useful enough that we'd want to.  Would this be a faster
> query in a world with a partial index on the first six characters?
> SELECT rev_id FROM revision WHERE rev_page=12345 AND
> rev_sha1='4cdbd80be15fcfff139fb8a95f2ca359520939ee'

I'd think so, yes.  MySQL should be smart enough to use partial
indexes at least that far -- otherwise there'd be no point in
supporting them.  I'd think an index on the first several bytes should
be almost as effective as one on the whole value, if you just want to
do filtering or joining (not sorting).  It might be more effective,
since more would fit in RAM.  However, I don't know for sure offhand.

> ...and would either of these queries be considered too expensive to
> run without a partial index?  How about with a partial index?

Without a suitable index, running either of these queries would scan
the entire revision history of the article in question.  That would
certainly not be acceptable on the cluster if the query is to be run
with any frequency on arbitrary articles.  It would be okay if it were
just an occasional thing run on a limited number of articles, or only
on articles with few revisions in their histories.

All of the above should be interpreted with the understanding that I
stand a decent chance of knowing what I'm talking about, but have no
say at this point and shouldn't be taken too seriously.  :)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-04 Thread Krinkle
2011/9/4 MZMcBride 

> Diederik van Liere wrote:
> > I've suggested to generate bulk checksums as well but both Brion and
> Ariel see
> > the primary purpose of this field to check the validity of the dump
> generating
> > process and so they want to generate the checksums straight from the
> external
> > storage.
> >
> > [...]
> >
> > PS: not sure if this proposal should be on strategy or mediawiki...
>
> I think standard practice nowadays is a subpage of
> .
>
> MZMcBride
>
>
Indeed. Moved:
http://mediawiki.org/wiki/Requests_for_comment/Database_field_for_checksum_of_page_text


– Krinkle
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-03 Thread MZMcBride
Diederik van Liere wrote:
> I've suggested to generate bulk checksums as well but both Brion and Ariel see
> the primary purpose of this field to check the validity of the dump generating
> process and so they want to generate the checksums straight from the external
> storage. 
> 
> [...]
> 
> PS: not sure if this proposal should be on strategy or mediawiki...

I think standard practice nowadays is a subpage of
.

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-03 Thread Roan Kattouw
On Sat, Sep 3, 2011 at 2:20 AM, Asher Feldman  wrote:
> Is code written to populate rev_sha1 on each new edit?
>
I believe that was part of Aaron's code that got reverted, yes.

Offline generation of hashes is definitely possible, but the only
reason you'd do it is to minimize the time during which some rows have
an empty hash. It's not strictly offline in that the text will still
have to be pulled from ES, but it can be done before the field is
added.

Roan

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-03 Thread Diederik van Liere
Hi,

I've suggested to generate bulk checksums as well but both Brion and Ariel see 
the primary purpose of this field to check the validity of the dump generating 
process and so they want to generate the checksums straight from the external 
storage. 

In a general sense, there are two use cases for this new field:
1) Checking the validity of the XML dump files
2) Identifying reverts

I have started to work on a proposal for deployment (and while being 
incomplete) it might be a good start to further plan the deployment. I have 
been trying to come up with some back-of-the-envelope calculations about how 
much time and space it would take but I don't have all the required information 
yet to come up with some reasonable estimates. 

You can find the proposal here: 
http://strategy.wikimedia.org/wiki/Proposal:Implement_and_deploy_checksum_revision_table

I want to thank Brion and Asher for giving feedback on prior drafts. Please 
feel free to improve this proposal.

Best,
Diederik

PS: not sure if this proposal should be on strategy or mediawiki...


On 2011-09-03, at 7:16 AM, Daniel Friesen wrote:

> On 11-09-02 09:33 PM, Rob Lanphier wrote:
>> On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen
>>  wrote:
>>> On 11-09-02 05:20 PM, Asher Feldman wrote:
 When using for analysis, will we wish the new columns had partial indexes
 (first 6 characters?)
>>> Bug 2939 is one relevant bug to this, it could probably use an index.
>>> [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939
>> My understanding is that having a normal index on a table the size of
>> our revision table will be far too expensive for db writes.
>> ...
>> Rob
> We've got 5 normal indexes on revision:
> - A unique int+int
> - A binary(14)
> - An int+binary(14)
> - Another int+binary(14)
> - And a varchar(255)+binary(14)
> 
> That bug wise a (rev_page,rev_sha1) or (rev_page,rev_timestamp,rev_sha1)
> may do.
> 
> -- 
> ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
> 
> 
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-02 Thread Daniel Friesen
On 11-09-02 09:33 PM, Rob Lanphier wrote:
> On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen
>  wrote:
>> On 11-09-02 05:20 PM, Asher Feldman wrote:
>>> When using for analysis, will we wish the new columns had partial indexes
>>> (first 6 characters?)
>> Bug 2939 is one relevant bug to this, it could probably use an index.
>> [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939
>  My understanding is that having a normal index on a table the size of
> our revision table will be far too expensive for db writes.
> ...
> Rob
We've got 5 normal indexes on revision:
- A unique int+int
- A binary(14)
- An int+binary(14)
- Another int+binary(14)
- And a varchar(255)+binary(14)

That bug wise a (rev_page,rev_sha1) or (rev_page,rev_timestamp,rev_sha1)
may do.

-- 
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-02 Thread Rob Lanphier
On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen
 wrote:
> On 11-09-02 05:20 PM, Asher Feldman wrote:
>> When using for analysis, will we wish the new columns had partial indexes
>> (first 6 characters?)

> Bug 2939 is one relevant bug to this, it could probably use an index.
> [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939

I generally suspect that a standard index is going to be a waste for
the most urgent uses of this.  It will rarely be interesting to search
for common hashes between articles.  The far more common case will be
to search for duplicate hashes within the history of a single article.
 My understanding is that having a normal index on a table the size of
our revision table will be far too expensive for db writes.

This is the first I've heard of partial indexes (/me researches)  I
don't know if a partial index is going to be cheap enough that we can
use it, and useful enough that we'd want to.  Would this be a faster
query in a world with a partial index on the first six characters?
SELECT rev_id FROM revision WHERE rev_page=12345 AND
rev_sha1='4cdbd80be15fcfff139fb8a95f2ca359520939ee'

...or would we have to run a query like this to get the benefit of the index?
SELECT rev_id FROM revision WHERE rev_page=12345 AND rev_sha1 like '4cdbd8%'

...and would either of these queries be considered too expensive to
run without a partial index?  How about with a partial index?

Rob

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-02 Thread Daniel Friesen
Bug 2939 is one relevant bug to this, it could probably use an index.

[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

On 11-09-02 05:20 PM, Asher Feldman wrote:
> Would it be possible to generate offline hashes for the bulk of our revision
> corpus via dumps and load that into prod to minimize the time and impact of
> the backfill?
>
> When using for analysis, will we wish the new columns had partial indexes
> (first 6 characters?)
>
> Is code written to populate rev_sha1 on each new edit?
>
> On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere 
> wrote:
>
>> Hi!
>> I am starting this thread because Brion's revision r94289 reverted
>> r94289 [0] stating "core schema change with no discussion" [1].
>> Bugs 21860 [2] and 25312 [3] advocate for the inclusion of a hash
>> column (either md5 or sha1) in the revision table. The primary use
>> case of this column will be to assist detecting reverts. I don't think
>> that data integrity is the primary reason for adding this column. The
>> huge advantage of having such a column is that it will not be longer
>> necessary to analyze full dumps to detect reverts, instead you can
>> look for reverts in the stub dump file by looking for the same hash
>> within a single page. The fact that there is a theoretical chance of a
>> collision is not very important IMHO, it would just mean that in very
>> rare cases in our research we would flag an edit being reverted  while
>> it's not. The two bug reports contain quite long discussions and this
>> feature has also been discussed internally quite extensively but oddly
>> enough it hasn't happened yet on the mailinglist.
>>
>> So let's have a discussion!
>>
>> [0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94289
>> [1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94541
>> [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=21860
>> [3] https://bugzilla.wikimedia.org/show_bug.cgi?id=25312
>>
>> Best,
>>
>> Diederik
>>
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>

-- 
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-02 Thread Asher Feldman
Would it be possible to generate offline hashes for the bulk of our revision
corpus via dumps and load that into prod to minimize the time and impact of
the backfill?

When using for analysis, will we wish the new columns had partial indexes
(first 6 characters?)

Is code written to populate rev_sha1 on each new edit?

On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere wrote:

> Hi!
> I am starting this thread because Brion's revision r94289 reverted
> r94289 [0] stating "core schema change with no discussion" [1].
> Bugs 21860 [2] and 25312 [3] advocate for the inclusion of a hash
> column (either md5 or sha1) in the revision table. The primary use
> case of this column will be to assist detecting reverts. I don't think
> that data integrity is the primary reason for adding this column. The
> huge advantage of having such a column is that it will not be longer
> necessary to analyze full dumps to detect reverts, instead you can
> look for reverts in the stub dump file by looking for the same hash
> within a single page. The fact that there is a theoretical chance of a
> collision is not very important IMHO, it would just mean that in very
> rare cases in our research we would flag an edit being reverted  while
> it's not. The two bug reports contain quite long discussions and this
> feature has also been discussed internally quite extensively but oddly
> enough it hasn't happened yet on the mailinglist.
>
> So let's have a discussion!
>
> [0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94289
> [1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94541
> [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=21860
> [3] https://bugzilla.wikimedia.org/show_bug.cgi?id=25312
>
> Best,
>
> Diederik
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-02 Thread Erik Moeller
On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere  wrote:
> Hi!
> I am starting this thread because Brion's revision r94289 reverted
> r94289 [0] stating "core schema change with no discussion" [1].

Bumping this: What are the remaining open questions regarding this
schema change? As Diederik explains, this has a number of potential
benefits especially for research/analysis, and I think it makes sense
to target it for 1.19 or 1.20 unless there are scary implications that
haven't been surfaced yet. :-)

-- 
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l