Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-12-07 Thread Daniel Kinzler
Am 06.12.2017 um 22:09 schrieb John Erling Blad:
> What is the current state, will some kind of digest be retained?

The current plan is to keep the SHA1 hash, one for each slot, and an aggregate
one for the revision. If there is only one slot, the revision hash is the same
as the slat hash.

-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-12-06 Thread John Erling Blad
What is the current state, will some kind of digest be retained?

On Thu, Sep 21, 2017 at 9:56 PM, Gergo Tisza  wrote:

> On Thu, Sep 21, 2017 at 6:10 AM, Daniel Kinzler <
> daniel.kinz...@wikimedia.de
> > wrote:
>
> > Yes, we could put it into a separate table. But that table would be
> > exactly as
> > tall as the content table, and would be keyed to it. I see no advantage.
>
>
> The advantage is that MediaWiki almost would never need to use the hash
> table. It would need to add the hash for a new revision there, but table
> size is not much of an issue on INSERT; other than that, only slow
> operations like export and API requests which explicitly ask for the hash
> would need to join on that table.
> Or this primarily a disk space concern?
>
> > Also, since content is supposed to be deduplicated (so two revisions with
> > > the exact same content will have the same content_address), cannot that
> > > replace content_sha1 for revert detection purposes?
> >
> > Only if we could detect and track "manual" reverts. And the only reliable
> > way to
> > do this right now is by looking at the sha1.
>
>
> The content table points to a blob store which is content-addressible and
> has its own deduplication mechanism, right? So you just send it the content
> to store, and get an address back, and in the case of a manual revert, that
> address will be one that has already been used in other content rows. Or do
> you need to detect the revert before saving it?
>
> SHA1 is not that slow.
> >
>
> For the API/Special:Export definitely not. Maybe for generating the
> official dump files it might be significant? A single sha1 operation on a
> modern CPU should not take more than a microsecond: there are a few hundred
> operations in a decently implemented sha1 and processors are in the GHz
> range. PHP benchmarks [1] also give similar values. With the 64-byte block
> size, that's something like 5 hours/TB - not sure how that compares to the
> dump process itself (also it's probably running on lots of cores in
> parallel).
>
>
> [1] http://www.spudsdesign.com/benchmark/index.php?t=hash1
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-21 Thread Gergo Tisza
On Thu, Sep 21, 2017 at 6:10 AM, Daniel Kinzler  wrote:

> Yes, we could put it into a separate table. But that table would be
> exactly as
> tall as the content table, and would be keyed to it. I see no advantage.


The advantage is that MediaWiki almost would never need to use the hash
table. It would need to add the hash for a new revision there, but table
size is not much of an issue on INSERT; other than that, only slow
operations like export and API requests which explicitly ask for the hash
would need to join on that table.
Or this primarily a disk space concern?

> Also, since content is supposed to be deduplicated (so two revisions with
> > the exact same content will have the same content_address), cannot that
> > replace content_sha1 for revert detection purposes?
>
> Only if we could detect and track "manual" reverts. And the only reliable
> way to
> do this right now is by looking at the sha1.


The content table points to a blob store which is content-addressible and
has its own deduplication mechanism, right? So you just send it the content
to store, and get an address back, and in the case of a manual revert, that
address will be one that has already been used in other content rows. Or do
you need to detect the revert before saving it?

SHA1 is not that slow.
>

For the API/Special:Export definitely not. Maybe for generating the
official dump files it might be significant? A single sha1 operation on a
modern CPU should not take more than a microsecond: there are a few hundred
operations in a decently implemented sha1 and processors are in the GHz
range. PHP benchmarks [1] also give similar values. With the 64-byte block
size, that's something like 5 hours/TB - not sure how that compares to the
dump process itself (also it's probably running on lots of cores in
parallel).


[1] http://www.spudsdesign.com/benchmark/index.php?t=hash1
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-21 Thread Daniel Kinzler
Am 21.09.2017 um 17:18 schrieb Federico Leva:
> (Offlist)
> 
> Daniel Kinzler, 21/09/2017 17:24:
>> Hashing is a lot faster than loading the content. Since Special:Export needs 
>> to
>> load the content anyway, the extra cost of hashing is negligible.
> 
> I trust you, but really? Even when exporting 5000 revisions?

Exporting 5000 revisions is likely to time out due to the time it takes to even
load all the data. If we can load the data, we can probably also hash it in
time. SHA1 is not that slow. Hashing all 1269 PHP files in the includes
directory takes half a second of CPU time on my system (about 2 seconds wall
clock time).

Hashing does put considerable load on the CPU though (on an otherwise I/O bound
operation), so it may cause problems if a lot of people do it. But since we have
a lot more edits than exports, and every edit needs hashing, I don't think thiat
makes much of a difference either.

-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-21 Thread Daniel Kinzler
Am 21.09.2017 um 11:24 schrieb Federico Leva (Nemo):
> The revision hashes are also supposed to be used by at least some of the 
> import
> tools for XML dumps. The dumps would be less valuable without some way to 
> check
> their content.

While this is a typical use cacse for hashes in theory, i have never heard of
any MediaWiki related tool actually doing this.

> Generating hashes on the fly is surely not an option given
> exports can also need to happen within the time of a PHP request 
> (Special:Export
> for instance).

Hashing is a lot faster than loading the content. Since Special:Export needs to
load the content anyway, the extra cost of hashing is negligible.


If we only need the hashes in contexts where we also need the full content,
generating on the fly should work fine.

But if we need revision hashes in a list of 500 revisions returned from the API,
*that* we can't calculate on the fly. Similarly, database queries that need the
hashes to detect revisions with the same content can't use on-the-fly hashes.

-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-21 Thread Daniel Kinzler
Am 19.09.2017 um 20:48 schrieb Gergo Tisza:
> On Tue, Sep 19, 2017 at 6:42 AM, Daniel Kinzler  Can't you just split it into a separate table? Core would only need to
> touch it on insert/update, so that should resolve the performance concerns.

Yes, we could put it into a separate table. But that table would be exactly as
tall as the content table, and would be keyed to it. I see no advantage. But if
DBAs prefer a separate table with a 1:1 relation to the content table, that's
fine with me.

Note that the content table is indeed touched a lot less than the revision 
table.

> Also, since content is supposed to be deduplicated (so two revisions with
> the exact same content will have the same content_address), cannot that
> replace content_sha1 for revert detection purposes?

Only if we could detect and track "manual" reverts. And the only reliable way to
do this right now is by looking at the sha1.

-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-21 Thread Federico Leva (Nemo)
The revision hashes are also supposed to be used by at least some of the 
import tools for XML dumps. The dumps would be less valuable without 
some way to check their content. Generating hashes on the fly is surely 
not an option given exports can also need to happen within the time of a 
PHP request (Special:Export for instance).


Nemo

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-19 Thread Gergo Tisza
On Tue, Sep 19, 2017 at 6:42 AM, Daniel Kinzler  wrote:

> That table will be tall, and the sha1 is the (on average) largest field.
> If we
> are going to use a different mechanism for tracking reverts soon, my hope
> was
> that we can do without it.
>

Can't you just split it into a separate table? Core would only need to
touch it on insert/update, so that should resolve the performance concerns.

Also, since content is supposed to be deduplicated (so two revisions with
the exact same content will have the same content_address), cannot that
replace content_sha1 for revert detection purposes? That wouldn't work over
large periods of time (when the original revision and the revert live in
different kinds of stores) but maybe that's an acceptable compromise.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-19 Thread John Erling Blad
There are two important use cases; one where you want to identify previous
reverts, and one where you want to identify close matches. There are other
ways to do the first than to use a digest, but the digest opens up for
alternate client side algorithms. The last would typically be done by some
locally sensitive hashing. In both cases you don't want to download the
content of each revision, that is exactly why you want to use some kind of
hashes. If the hashes could be requested somehow, perhaps as part of the
API, then it should be sufficient. Those hashes could be part of the XML
dump too, but if you have the XML-dump and know the algorithm, then you
don't need the digest.

There are a specific use case when someone want to verify the content. In
those cases you don't want to identify a previous revert, you want to check
whether someone has tempered with the downloaded content. As you don't know
who might have tempered with the content you should also question the
digest delivered by WMF, thus the digest in the database isn't good enough
as it is right now. Instead of a sha-digest each revision should be
properly signed, but then if you can't trust WMF can you trust their
signature? Signatures for revisions should probably be delivered by some
external entity and not WMF itselves.

On Fri, Sep 15, 2017 at 11:44 PM, Daniel Kinzler <
daniel.kinz...@wikimedia.de> wrote:

> A revert restores a previous revision. It covers all slots.
>
> The fact that reverts, watching, protecting, etc still works per page,
> while you
> can have multiple kinds of different content on the page, is indeed the
> point of
> MCR.
>
> Am 15.09.2017 um 22:23 schrieb C. Scott Ananian:
> > Alternatively, perhaps "hash" could be an optional part of an MCR chunk?
> > We could keep it for the wikitext, but drop the hash for the metadata,
> and
> > drop any support for a "combined" hash over wikitext + all-other-pieces.
> >
> > ...which begs the question about how reverts work in MCR.  Is it just the
> > wikitext which is reverted, or do categories and other metadata revert as
> > well?  And perhaps we can just mark these at revert time instead of
> trying
> > to reconstruct it after the fact?
> >  --scott
> >
> > On Fri, Sep 15, 2017 at 4:13 PM, Stas Malyshev 
> > wrote:
> >
> >> Hi!
> >>
> >> On 9/15/17 1:06 PM, Andrew Otto wrote:
>  As a random idea - would it be possible to calculate the hashes
> >>> when data is transitioned from SQL to Hadoop storage?
> >>>
> >>> We take monthly snapshots of the entire history, so every month we’d
> >>> have to pull the content of every revision ever made :o
> >>
> >> Why? If you already seen that revision in previous snapshot, you'd
> >> already have its hash? Admittedly, I have no idea how the process works,
> >> so I am just talking out of general knowledge and may miss some things.
> >> Also of course you already have hashes from revs till this day and up to
> >> the day we decide to turn the hash off. Starting that day, it'd have to
> >> be generated, but I see no reason to generate one more than once?
> >> --
> >> Stas Malyshev
> >> smalys...@wikimedia.org
> >>
> >> ___
> >> Wikitech-l mailing list
> >> Wikitech-l@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>
> >
> >
> >
>
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-19 Thread Daniel Kinzler
Am 19.09.2017 um 10:15 schrieb Jaime Crespo:
> I am not a mediawiki developer, but shouldn't sha1 be moved instead of
> deleted/not deleted? Moved to the content table- so it is kept
> unaltered.
The background of my original mail is indede the question whether we need the
sha1 field in the content table. The current draft of the DB schema  includes 
it.

That table will be tall, and the sha1 is the (on average) largest field. If we
are going to use a different mechanism for tracking reverts soon, my hope was
that we can do without it.

OIn any case, my impression is that if we want to keep using hashes to detect
reverts, we need to keep rev_sha1 - and to maintain is, we ALSO need 
content_sha1.

-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-19 Thread Jaime Crespo
I am not a mediawiki developer, but shouldn't sha1 be moved instead of
deleted/not deleted? Moved to the content table- so it is kept
unaltered.

That way it can be used for all the goals that have been discussed
(detect reversions, XML dumps, etc.) and they are not altered, just
moved away (being more compatible). And it is not like structure
compatibility is going to be kept, as many fields are going to be
"moved" there, so code using the tables directly has to change anyway;
but if the actual content is not altered, the sha field can be kept
unaltered with the same value as before. It would also allow to detect
a "partial revertion", that means, mediawiki text is set to the same
than a previous one, which is what I assume it is used now. However,
now there will be other content that can be reverted individually.

I do not know what exactly MCR is going to be used for, but if (silly
idea), main text article and categories are 2 different contents of an
article, if user A edits both, and user B reverts the text only, that
would get a different revision sha1 value; however, most reasons here
would want to detect the reversion by checking the sha of the text
only (aka content). Equally, for backwards compatibility, storing it
on content would allow to not have to recalculate it for all already
existing values literally reducing it to a "trivial" code change,
while keeping all old data valid. Keeping the field as is, on
revision, will mean all historical data and old dumps are invalid.
Full revision reversions, if needed, can be checked by checking each
individual content sha or the linked content ids.

If, on the other side, revision should be kept completely backwards
compatible, some helper views can be created on the cloud
wikireplicas, but other than that, MCR would not be possible.

If at a later time, text with the same hash is detected (and content
double checked), content could be normalized by assigning the same id
to the same content?

On Mon, Sep 18, 2017 at 8:25 PM, Danny B.  wrote:
>
> -- Původní e-mail --
> Od: Dan Andreescu 
> Komu: Wikimedia developers 
> Datum: 18. 9. 2017 16:26:18
> Předmět: Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
> "So, as things stand, rev_sha1 in the database is used for:
>
> 1. the XML dumps process and all the researchers depending on the XML dumps
> (probably just for revert detection)
> 2. revert detection for libraries like python-mwreverts [1]
> 3. revert detection in mediawiki history reconstruction processes in Hadoop
> (Wikistats 2.0)
> 4. revert detection in Wikistats 1.0
> 5. revert detection for tools that run on labs, like Wikimetrics
> ?. I think Aaron also uses rev_sha1 in ORES, but I can't seem to find the
> latest code for that service
>
> If you think about this list above as a flow of data, you'll see that
> rev_sha1 is replicated to xml, labs databases, hadoop, ML models, etc. So
> removing it and adding it back downstream from the main mediawiki database
> somewhere, like in XML, cuts off the other places that need it. That means
> it must be available either in the mediawiki database or in some other
> central database which all those other consumers can pull from.
> "
>
>
>
> I use rev_sha1 on replicas to check the consistency of modules, templates or
> other pages (typically help) which should be same between projects (either
> within one language or even crosslanguage, if the page is not language
> dependent). In other words to detect possible changes in them and syncing
> them.
>
>
>
>
> Also, I haven't noticed it mentioned in the thread: Flow also notices users
> on reverts, but IDK whether it uses rev_sha1 or not. So I'm rather
> mentioning it.
>
>
>
>
>
>
>
> Kind regards
>
>
>
>
>
>
>
> Danny B.
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Jaime Crespo
<http://wikimedia.org>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-18 Thread Danny B.

-- Původní e-mail --
Od: Dan Andreescu 
Komu: Wikimedia developers 
Datum: 18. 9. 2017 16:26:18
Předmět: Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)? 
"So, as things stand, rev_sha1 in the database is used for:

1. the XML dumps process and all the researchers depending on the XML dumps
(probably just for revert detection)
2. revert detection for libraries like python-mwreverts [1]
3. revert detection in mediawiki history reconstruction processes in Hadoop
(Wikistats 2.0)
4. revert detection in Wikistats 1.0
5. revert detection for tools that run on labs, like Wikimetrics
?. I think Aaron also uses rev_sha1 in ORES, but I can't seem to find the
latest code for that service

If you think about this list above as a flow of data, you'll see that
rev_sha1 is replicated to xml, labs databases, hadoop, ML models, etc. So
removing it and adding it back downstream from the main mediawiki database
somewhere, like in XML, cuts off the other places that need it. That means
it must be available either in the mediawiki database or in some other
central database which all those other consumers can pull from.
"



I use rev_sha1 on replicas to check the consistency of modules, templates or
other pages (typically help) which should be same between projects (either 
within one language or even crosslanguage, if the page is not language 
dependent). In other words to detect possible changes in them and syncing 
them.




Also, I haven't noticed it mentioned in the thread: Flow also notices users 
on reverts, but IDK whether it uses rev_sha1 or not. So I'm rather 
mentioning it.







Kind regards







Danny B.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-18 Thread Dan Andreescu
So, as things stand, rev_sha1 in the database is used for:

1. the XML dumps process and all the researchers depending on the XML dumps
(probably just for revert detection)
2. revert detection for libraries like python-mwreverts [1]
3. revert detection in mediawiki history reconstruction processes in Hadoop
(Wikistats 2.0)
4. revert detection in Wikistats 1.0
5. revert detection for tools that run on labs, like Wikimetrics
?. I think Aaron also uses rev_sha1 in ORES, but I can't seem to find the
latest code for that service

If you think about this list above as a flow of data, you'll see that
rev_sha1 is replicated to xml, labs databases, hadoop, ML models, etc.  So
removing it and adding it back downstream from the main mediawiki database
somewhere, like in XML, cuts off the other places that need it.  That means
it must be available either in the mediawiki database or in some other
central database which all those other consumers can pull from.

I defer to your expertise when you say it's expensive to keep in the db,
and I can see how that would get much worse with MCR.  I'm sure we can
figure something out, though.  Right now it seems like our options are, as
others have pointed out:

* compute async and store in DB or somewhere else that's central and easy
to access from all the branches I mentioned
* update how we detect reverts and keep a revert database with good
references to wiki_db, rev_id so it can be brought back in context.

Personally, I would love to get better revert detection, using sha1 exact
matches doesn't really get to the heart of the issue.  Important phenomena
like revert wars, bullying, and stalking are hiding behind bad revert
detection.  I'm happy to brainstorm ways we can use Analytics
infrastructure to do this.  We definitely have the tools necessary, but not
so much the man-power.  That said, please don't strip out rev_sha1 until
we've accounted for all its "data customers".

So, put another way, I think it's totally fine if we say ok everyone, from
date XYZ, you will no longer have rev_sha1 in the database, but if you want
to know whether an edit reverts a previous edit or a series of edits, go
*HERE*.  That's fine.  And just for context, here's how we do our revert
detection in Hadoop (it's pretty fancy) [2].


[1] https://github.com/mediawiki-utilities/python-mwreverts
[2]
https://github.com/wikimedia/analytics-refinery-source/blob/1d38b8e4acfd10dc811279826ffdff236e8b0f2d/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/denormalized/DenormalizedRevisionsBuilder.scala#L174-L317

On Mon, Sep 18, 2017 at 9:19 AM, Daniel Kinzler  wrote:

> Am 16.09.2017 um 01:22 schrieb Matthew Flaschen:
> > On 09/15/2017 06:51 AM, Daniel Kinzler wrote:
> >> Also, I believe Roan is currently looking for a better mechanism for
> tracking
> >> all kinds of reverts directly.
> >
> > Let's see if we want to use rev_sha1 for that better solution (a way to
> track
> > reverts within MW itself) before we drop it.
>
>
> The problem is that if we don't drop is, we have to *introduce* it for the
> new
> content table for MCR. I'd like to avoid that.
>
> I guess we can define the field and just null it, but... well. I'd like to
> avoid
> that.
>
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-18 Thread Daniel Kinzler
Am 16.09.2017 um 01:22 schrieb Matthew Flaschen:
> On 09/15/2017 06:51 AM, Daniel Kinzler wrote:
>> Also, I believe Roan is currently looking for a better mechanism for tracking
>> all kinds of reverts directly.
> 
> Let's see if we want to use rev_sha1 for that better solution (a way to track
> reverts within MW itself) before we drop it.


The problem is that if we don't drop is, we have to *introduce* it for the new
content table for MCR. I'd like to avoid that.

I guess we can define the field and just null it, but... well. I'd like to avoid
that.


-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-17 Thread MZMcBride
Antoine Musso wrote:
>I guess Aaron Halfaker, Brion Vibber, Aaron Schulz would have some
>insights about it.

Yes. Brion started a thread about the use of SHA-1 in February 2017:

https://lists.wikimedia.org/pipermail/wikitech-l/2017-February/087664.html
https://lists.wikimedia.org/pipermail/wikitech-l/2017-February/087666.html

Of note, we have .

The use of base-36 SHA-1 instead of base-16 SHA-1 for revision.rev_sha1
has always perplexed me. It'd be nice to better(?) document that design
decision. It's referenced here:
https://lists.wikimedia.org/pipermail/wikitech-l/2012-September/063445.html

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-16 Thread Antoine Musso

On 15/09/2017 12:51, Daniel Kinzler wrote:


I'm working on the database schema for Multi-Content-Revisions (MCR)
  and 
I'd
like to get rid of the rev_sha1 field:

Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more
expensive with MCR. With multiple content objects per revision, we need to track
the hash for each slot, and then re-calculate the sha1 for each revision.



Hello,

That was introduced by Aaron Schulz. The purpose is to have them pre 
computed since that is quite expensive to  have to do it on million of rows.


A use case was to easily detect reverts.

See for reference:
https://phabricator.wikimedia.org/T23860
https://phabricator.wikimedia.org/T27312

I guess Aaron Halfaker, Brion Vibber, Aaron Schulz would have some 
insights about it.




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Gergo Tisza
At a quick glance, EventBus and FlaggedRevs are the two extensions using
the hashes. EventBust just puts them into the emitted data; FlaggedRevs
detects reverts to the latest stable revision that way (so there is no
rev_sha1 based lookup in either case, although in the case of FlaggedRevs I
could imagine a use case for something like that).

Files on the other hand use hash lookups a lot, and AIUI they are planned
to become MCR slots eventually.

For a quick win, you could just reduce the hash size. We have around a
billion revisions, and probably won't ever have more than a trillion;
square that for birthday effect and add a couple extra zeros just to be
sure, and it still fits comfortably into 80 bits. If hashes only need to be
unique within the same page then maybe 30-40.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Matthew Flaschen

On 09/15/2017 06:51 AM, Daniel Kinzler wrote:

Also, I believe Roan is currently looking for a better mechanism for tracking
all kinds of reverts directly.


Let's see if we want to use rev_sha1 for that better solution (a way to 
track reverts within MW itself) before we drop it.


I know Roan is planning to write an RFC on reverts.

Matt

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Daniel Kinzler
Ok, a little more detail here:

For MCR, we would have to keep around the hash of each content object ("slot")
AND of each revision. This makes the revision and content tables "wider", which
is a problem because they grow quite "tall", too. It also means we have to
compute a hash of hashes for each revision, but that's not horrible.

I'm hoping we can remove the hash from both tables. Keeping the hash of each
content object and/or each revision somewhere else is fine with me. Perhaps it's
sufficient to generate it when generating XML dumps. Maybe we want it in hadoop.
Maybe we want to have it in a separate SQL database. But perhaps we don't
actually need it.

Can someone explain *why* they want the hash at all?

Am 15.09.2017 um 22:01 schrieb Stas Malyshev:
> Hi!
> 
>> We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
>> from the little I know:
>>
>> Most analytical computations (for things like reverts, as you say) don’t
>> have easy access to content, so computing SHAs on the fly is pretty hard.
>> MediaWiki history reconstruction relies on the SHA to figure out what
>> revisions revert other revisions, as there is no reliable way to know if
>> something is a revert other than by comparing SHAs.
> 
> As a random idea - would it be possible to calculate the hashes when
> data is transitioned from SQL to Hadoop storage? I imagine that would
> slow down the transition, but not sure if it'd be substantial or not. If
> we're using the hash just to compare revisions, we could also use
> different hash (maybe non-crypto hash?) which may be faster.
> 


-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Daniel Kinzler
A revert restores a previous revision. It covers all slots.

The fact that reverts, watching, protecting, etc still works per page, while you
can have multiple kinds of different content on the page, is indeed the point of
MCR.

Am 15.09.2017 um 22:23 schrieb C. Scott Ananian:
> Alternatively, perhaps "hash" could be an optional part of an MCR chunk?
> We could keep it for the wikitext, but drop the hash for the metadata, and
> drop any support for a "combined" hash over wikitext + all-other-pieces.
> 
> ...which begs the question about how reverts work in MCR.  Is it just the
> wikitext which is reverted, or do categories and other metadata revert as
> well?  And perhaps we can just mark these at revert time instead of trying
> to reconstruct it after the fact?
>  --scott
> 
> On Fri, Sep 15, 2017 at 4:13 PM, Stas Malyshev 
> wrote:
> 
>> Hi!
>>
>> On 9/15/17 1:06 PM, Andrew Otto wrote:
 As a random idea - would it be possible to calculate the hashes
>>> when data is transitioned from SQL to Hadoop storage?
>>>
>>> We take monthly snapshots of the entire history, so every month we’d
>>> have to pull the content of every revision ever made :o
>>
>> Why? If you already seen that revision in previous snapshot, you'd
>> already have its hash? Admittedly, I have no idea how the process works,
>> so I am just talking out of general knowledge and may miss some things.
>> Also of course you already have hashes from revs till this day and up to
>> the day we decide to turn the hash off. Starting that day, it'd have to
>> be generated, but I see no reason to generate one more than once?
>> --
>> Stas Malyshev
>> smalys...@wikimedia.org
>>
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
> 
> 
> 


-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Daniel Kinzler
Am 15.09.2017 um 19:49 schrieb Erik Zachte:
> Compute the hashes on the fly for the offline analysis doesn’t work for 
> Wikistats 1.0, as it only parses the stub dumps, without article content, 
> just metadata.
> Parsing the full archive dumps is a quite expensive, time-wise.

We can always compute the hash when outputting XML dumps that contain the full
content (it's already loaded, so no big deal), and then generate the XML dump
with only meta-data from the full dump.


-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Chad
We could keep it in the XML dumps (it's part of the XSD after all)...just
compute it at export time. Not terribly hard, I don't think, we should have
the parsed content already on hand

-Chad

On Fri, Sep 15, 2017 at 12:51 PM James Hare  wrote:

> What I wonder is – does this *need* to be a part of the database table, or
> can it be a dataset generated from each revision and then published
> separately? This way each user wouldn’t have to individually compute the
> hashes while we also get the (ostensible) benefit of getting them out of
> the table.
>
> On September 15, 2017 at 12:41:03 PM, Andrew Otto (o...@wikimedia.org)
> wrote:
>
> We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
> from the little I know:
>
> Most analytical computations (for things like reverts, as you say) don’t
> have easy access to content, so computing SHAs on the fly is pretty hard.
> MediaWiki history reconstruction relies on the SHA to figure out what
> revisions revert other revisions, as there is no reliable way to know if
> something is a revert other than by comparing SHAs.
>
> See
>
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
> (particularly the *revert* fields).
>
>
>
> On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte 
> wrote:
>
> > Compute the hashes on the fly for the offline analysis doesn’t work for
> > Wikistats 1.0, as it only parses the stub dumps, without article content,
> > just metadata.
> > Parsing the full archive dumps is a quite expensive, time-wise.
> >
> > This may change with Wikistats 2.0 with has a totally different process
> > flow. That I can't tell.
> >
> > Erik Zachte
> >
> > -Original Message-
> > From: Wikitech-l [mailto:wikitech-l-boun...@lists.wikimedia.org] On
> > Behalf Of Daniel Kinzler
> > Sent: Friday, September 15, 2017 12:52
> > To: Wikimedia developers 
> > Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
> >
> > Hi all!
> >
> > I'm working on the database schema for Multi-Content-Revisions (MCR) <
> > https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema>
> > and I'd like to get rid of the rev_sha1 field:
> >
> > Maintaining revision hashes (the rev_sha1 field) is expensive, and
> becomes
> > more expensive with MCR. With multiple content objects per revision, we
> > need to track the hash for each slot, and then re-calculate the sha1 for
> > each revision.
> >
> > That's expensive especially in terms of bytes-per-database-row, which
> > impacts query performance.
> >
> > So, what do we need the rev_sha1 field for? As far as I know, nothing in
> > core uses it, and I'm not aware of any extension using it either. It
> seems
> > to be used primarily in offline analysis for detecting (manual) reverts
> by
> > looking for revisions with the same hash.
> >
> > Is that reason enough for dragging all the hashes around the database
> with
> > every revision update? Or can we just compute the hashes on the fly for
> the
> > offline analysis? Computing hashes is slow since the content needs to be
> > loaded first, but it would only have to be done for pairs of revisions of
> > the same page with the same size, which should be a pretty good
> > optimization.
> >
> > Also, I believe Roan is currently looking for a better mechanism for
> > tracking all kinds of reverts directly.
> >
> > So, can we drop rev_sha1?
> >
> > --
> > Daniel Kinzler
> > Principal Platform Engineer
> >
> > Wikimedia Deutschland
> > Gesellschaft zur Förderung Freien Wissens e.V.
> >
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> >
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread C. Scott Ananian
Alternatively, perhaps "hash" could be an optional part of an MCR chunk?
We could keep it for the wikitext, but drop the hash for the metadata, and
drop any support for a "combined" hash over wikitext + all-other-pieces.

...which begs the question about how reverts work in MCR.  Is it just the
wikitext which is reverted, or do categories and other metadata revert as
well?  And perhaps we can just mark these at revert time instead of trying
to reconstruct it after the fact?
 --scott

On Fri, Sep 15, 2017 at 4:13 PM, Stas Malyshev 
wrote:

> Hi!
>
> On 9/15/17 1:06 PM, Andrew Otto wrote:
> >> As a random idea - would it be possible to calculate the hashes
> > when data is transitioned from SQL to Hadoop storage?
> >
> > We take monthly snapshots of the entire history, so every month we’d
> > have to pull the content of every revision ever made :o
>
> Why? If you already seen that revision in previous snapshot, you'd
> already have its hash? Admittedly, I have no idea how the process works,
> so I am just talking out of general knowledge and may miss some things.
> Also of course you already have hashes from revs till this day and up to
> the day we decide to turn the hash off. Starting that day, it'd have to
> be generated, but I see no reason to generate one more than once?
> --
> Stas Malyshev
> smalys...@wikimedia.org
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



-- 
(http://cscott.net)
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Stas Malyshev
Hi!

On 9/15/17 1:06 PM, Andrew Otto wrote:
>> As a random idea - would it be possible to calculate the hashes
> when data is transitioned from SQL to Hadoop storage?
> 
> We take monthly snapshots of the entire history, so every month we’d
> have to pull the content of every revision ever made :o

Why? If you already seen that revision in previous snapshot, you'd
already have its hash? Admittedly, I have no idea how the process works,
so I am just talking out of general knowledge and may miss some things.
Also of course you already have hashes from revs till this day and up to
the day we decide to turn the hash off. Starting that day, it'd have to
be generated, but I see no reason to generate one more than once?
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Andrew Otto
> can it be a dataset generated from each revision and then published
separately?

Perhaps it be generated asynchronously via a job?  Either stored in
revision or a separate table.

On Fri, Sep 15, 2017 at 4:06 PM, Andrew Otto  wrote:

> > As a random idea - would it be possible to calculate the hashes when data
> is transitioned from SQL to Hadoop storage?
>
> We take monthly snapshots of the entire history, so every month we’d have
> to pull the content of every revision ever made :o
>
>
> On Fri, Sep 15, 2017 at 4:01 PM, Stas Malyshev 
> wrote:
>
>> Hi!
>>
>> > We should hear from Joseph, Dan, Marcel, and Aaron H on this I think,
>> but
>> > from the little I know:
>> >
>> > Most analytical computations (for things like reverts, as you say) don’t
>> > have easy access to content, so computing SHAs on the fly is pretty
>> hard.
>> > MediaWiki history reconstruction relies on the SHA to figure out what
>> > revisions revert other revisions, as there is no reliable way to know if
>> > something is a revert other than by comparing SHAs.
>>
>> As a random idea - would it be possible to calculate the hashes when
>> data is transitioned from SQL to Hadoop storage? I imagine that would
>> slow down the transition, but not sure if it'd be substantial or not. If
>> we're using the hash just to compare revisions, we could also use
>> different hash (maybe non-crypto hash?) which may be faster.
>>
>> --
>> Stas Malyshev
>> smalys...@wikimedia.org
>>
>
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Andrew Otto
> As a random idea - would it be possible to calculate the hashes when data
is transitioned from SQL to Hadoop storage?

We take monthly snapshots of the entire history, so every month we’d have
to pull the content of every revision ever made :o


On Fri, Sep 15, 2017 at 4:01 PM, Stas Malyshev 
wrote:

> Hi!
>
> > We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
> > from the little I know:
> >
> > Most analytical computations (for things like reverts, as you say) don’t
> > have easy access to content, so computing SHAs on the fly is pretty hard.
> > MediaWiki history reconstruction relies on the SHA to figure out what
> > revisions revert other revisions, as there is no reliable way to know if
> > something is a revert other than by comparing SHAs.
>
> As a random idea - would it be possible to calculate the hashes when
> data is transitioned from SQL to Hadoop storage? I imagine that would
> slow down the transition, but not sure if it'd be substantial or not. If
> we're using the hash just to compare revisions, we could also use
> different hash (maybe non-crypto hash?) which may be faster.
>
> --
> Stas Malyshev
> smalys...@wikimedia.org
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Stas Malyshev
Hi!

> We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
> from the little I know:
> 
> Most analytical computations (for things like reverts, as you say) don’t
> have easy access to content, so computing SHAs on the fly is pretty hard.
> MediaWiki history reconstruction relies on the SHA to figure out what
> revisions revert other revisions, as there is no reliable way to know if
> something is a revert other than by comparing SHAs.

As a random idea - would it be possible to calculate the hashes when
data is transitioned from SQL to Hadoop storage? I imagine that would
slow down the transition, but not sure if it'd be substantial or not. If
we're using the hash just to compare revisions, we could also use
different hash (maybe non-crypto hash?) which may be faster.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread James Hare
What I wonder is – does this *need* to be a part of the database table, or
can it be a dataset generated from each revision and then published
separately? This way each user wouldn’t have to individually compute the
hashes while we also get the (ostensible) benefit of getting them out of
the table.

On September 15, 2017 at 12:41:03 PM, Andrew Otto (o...@wikimedia.org)
wrote:

We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
from the little I know:

Most analytical computations (for things like reverts, as you say) don’t
have easy access to content, so computing SHAs on the fly is pretty hard.
MediaWiki history reconstruction relies on the SHA to figure out what
revisions revert other revisions, as there is no reliable way to know if
something is a revert other than by comparing SHAs.

See
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
(particularly the *revert* fields).



On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte  wrote:

> Compute the hashes on the fly for the offline analysis doesn’t work for
> Wikistats 1.0, as it only parses the stub dumps, without article content,
> just metadata.
> Parsing the full archive dumps is a quite expensive, time-wise.
>
> This may change with Wikistats 2.0 with has a totally different process
> flow. That I can't tell.
>
> Erik Zachte
>
> -Original Message-
> From: Wikitech-l [mailto:wikitech-l-boun...@lists.wikimedia.org] On
> Behalf Of Daniel Kinzler
> Sent: Friday, September 15, 2017 12:52
> To: Wikimedia developers 
> Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
>
> Hi all!
>
> I'm working on the database schema for Multi-Content-Revisions (MCR) <
> https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema>
> and I'd like to get rid of the rev_sha1 field:
>
> Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes
> more expensive with MCR. With multiple content objects per revision, we
> need to track the hash for each slot, and then re-calculate the sha1 for
> each revision.
>
> That's expensive especially in terms of bytes-per-database-row, which
> impacts query performance.
>
> So, what do we need the rev_sha1 field for? As far as I know, nothing in
> core uses it, and I'm not aware of any extension using it either. It seems
> to be used primarily in offline analysis for detecting (manual) reverts by
> looking for revisions with the same hash.
>
> Is that reason enough for dragging all the hashes around the database with
> every revision update? Or can we just compute the hashes on the fly for
the
> offline analysis? Computing hashes is slow since the content needs to be
> loaded first, but it would only have to be done for pairs of revisions of
> the same page with the same size, which should be a pretty good
> optimization.
>
> Also, I believe Roan is currently looking for a better mechanism for
> tracking all kinds of reverts directly.
>
> So, can we drop rev_sha1?
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Andrew Otto
We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
from the little I know:

Most analytical computations (for things like reverts, as you say) don’t
have easy access to content, so computing SHAs on the fly is pretty hard.
MediaWiki history reconstruction relies on the SHA to figure out what
revisions revert other revisions, as there is no reliable way to know if
something is a revert other than by comparing SHAs.

See
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
(particularly the *revert* fields).



On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte  wrote:

> Compute the hashes on the fly for the offline analysis doesn’t work for
> Wikistats 1.0, as it only parses the stub dumps, without article content,
> just metadata.
> Parsing the full archive dumps is a quite expensive, time-wise.
>
> This may change with Wikistats 2.0 with has a totally different process
> flow. That I can't tell.
>
> Erik Zachte
>
> -Original Message-
> From: Wikitech-l [mailto:wikitech-l-boun...@lists.wikimedia.org] On
> Behalf Of Daniel Kinzler
> Sent: Friday, September 15, 2017 12:52
> To: Wikimedia developers 
> Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
>
> Hi all!
>
> I'm working on the database schema for Multi-Content-Revisions (MCR) <
> https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema>
> and I'd like to get rid of the rev_sha1 field:
>
> Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes
> more expensive with MCR. With multiple content objects per revision, we
> need to track the hash for each slot, and then re-calculate the sha1 for
> each revision.
>
> That's expensive especially in terms of bytes-per-database-row, which
> impacts query performance.
>
> So, what do we need the rev_sha1 field for? As far as I know, nothing in
> core uses it, and I'm not aware of any extension using it either. It seems
> to be used primarily in offline analysis for detecting (manual) reverts by
> looking for revisions with the same hash.
>
> Is that reason enough for dragging all the hashes around the database with
> every revision update? Or can we just compute the hashes on the fly for the
> offline analysis? Computing hashes is slow since the content needs to be
> loaded first, but it would only have to be done for pairs of revisions of
> the same page with the same size, which should be a pretty good
> optimization.
>
> Also, I believe Roan is currently looking for a better mechanism for
> tracking all kinds of reverts directly.
>
> So, can we drop rev_sha1?
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Erik Zachte
Compute the hashes on the fly for the offline analysis doesn’t work for 
Wikistats 1.0, as it only parses the stub dumps, without article content, just 
metadata.
Parsing the full archive dumps is a quite expensive, time-wise.

This may change with Wikistats 2.0 with has a totally different process flow. 
That I can't tell.

Erik Zachte

-Original Message-
From: Wikitech-l [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of 
Daniel Kinzler
Sent: Friday, September 15, 2017 12:52
To: Wikimedia developers 
Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

Hi all!

I'm working on the database schema for Multi-Content-Revisions (MCR) 
<https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema> and 
I'd like to get rid of the rev_sha1 field:

Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more 
expensive with MCR. With multiple content objects per revision, we need to 
track the hash for each slot, and then re-calculate the sha1 for each revision.

That's expensive especially in terms of bytes-per-database-row, which impacts 
query performance.

So, what do we need the rev_sha1 field for? As far as I know, nothing in core 
uses it, and I'm not aware of any extension using it either. It seems to be 
used primarily in offline analysis for detecting (manual) reverts by looking 
for revisions with the same hash.

Is that reason enough for dragging all the hashes around the database with 
every revision update? Or can we just compute the hashes on the fly for the 
offline analysis? Computing hashes is slow since the content needs to be loaded 
first, but it would only have to be done for pairs of revisions of the same 
page with the same size, which should be a pretty good optimization.

Also, I believe Roan is currently looking for a better mechanism for tracking 
all kinds of reverts directly.

So, can we drop rev_sha1?

--
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Daniel Kinzler
Hi all!

I'm working on the database schema for Multi-Content-Revisions (MCR)
 and I'd
like to get rid of the rev_sha1 field:

Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more
expensive with MCR. With multiple content objects per revision, we need to track
the hash for each slot, and then re-calculate the sha1 for each revision.

That's expensive especially in terms of bytes-per-database-row, which impacts
query performance.

So, what do we need the rev_sha1 field for? As far as I know, nothing in core
uses it, and I'm not aware of any extension using it either. It seems to be used
primarily in offline analysis for detecting (manual) reverts by looking for
revisions with the same hash.

Is that reason enough for dragging all the hashes around the database with every
revision update? Or can we just compute the hashes on the fly for the offline
analysis? Computing hashes is slow since the content needs to be loaded first,
but it would only have to be done for pairs of revisions of the same page with
the same size, which should be a pretty good optimization.

Also, I believe Roan is currently looking for a better mechanism for tracking
all kinds of reverts directly.

So, can we drop rev_sha1?

-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l