Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-12-07 Thread Daniel Kinzler
Am 06.12.2017 um 22:09 schrieb John Erling Blad: > What is the current state, will some kind of digest be retained? The current plan is to keep the SHA1 hash, one for each slot, and an aggregate one for the revision. If there is only one slot, the revision hash is the same as the slat hash. -- D

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-12-06 Thread John Erling Blad
What is the current state, will some kind of digest be retained? On Thu, Sep 21, 2017 at 9:56 PM, Gergo Tisza wrote: > On Thu, Sep 21, 2017 at 6:10 AM, Daniel Kinzler < > daniel.kinz...@wikimedia.de > > wrote: > > > Yes, we could put it into a separate table. But that table would be > > exactly

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-21 Thread Gergo Tisza
On Thu, Sep 21, 2017 at 6:10 AM, Daniel Kinzler wrote: > Yes, we could put it into a separate table. But that table would be > exactly as > tall as the content table, and would be keyed to it. I see no advantage. The advantage is that MediaWiki almost would never need to use the hash table. It

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-21 Thread Daniel Kinzler
Am 21.09.2017 um 17:18 schrieb Federico Leva: > (Offlist) > > Daniel Kinzler, 21/09/2017 17:24: >> Hashing is a lot faster than loading the content. Since Special:Export needs >> to >> load the content anyway, the extra cost of hashing is negligible. > > I trust you, but really? Even when export

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-21 Thread Daniel Kinzler
Am 21.09.2017 um 11:24 schrieb Federico Leva (Nemo): > The revision hashes are also supposed to be used by at least some of the > import > tools for XML dumps. The dumps would be less valuable without some way to > check > their content. While this is a typical use cacse for hashes in theory, i

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-21 Thread Daniel Kinzler
Am 19.09.2017 um 20:48 schrieb Gergo Tisza: > On Tue, Sep 19, 2017 at 6:42 AM, Daniel Kinzler Can't you just split it into a separate table? Core would only need to > touch it on insert/update, so that should resolve the performance concerns. Yes, we could put it into a separate table. But that t

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-21 Thread Federico Leva (Nemo)
The revision hashes are also supposed to be used by at least some of the import tools for XML dumps. The dumps would be less valuable without some way to check their content. Generating hashes on the fly is surely not an option given exports can also need to happen within the time of a PHP requ

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-19 Thread Gergo Tisza
On Tue, Sep 19, 2017 at 6:42 AM, Daniel Kinzler wrote: > That table will be tall, and the sha1 is the (on average) largest field. > If we > are going to use a different mechanism for tracking reverts soon, my hope > was > that we can do without it. > Can't you just split it into a separate table

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-19 Thread John Erling Blad
There are two important use cases; one where you want to identify previous reverts, and one where you want to identify close matches. There are other ways to do the first than to use a digest, but the digest opens up for alternate client side algorithms. The last would typically be done by some loc

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-19 Thread Daniel Kinzler
Am 19.09.2017 um 10:15 schrieb Jaime Crespo: > I am not a mediawiki developer, but shouldn't sha1 be moved instead of > deleted/not deleted? Moved to the content table- so it is kept > unaltered. The background of my original mail is indede the question whether we need the sha1 field in the content

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-19 Thread Jaime Crespo
il -- > Od: Dan Andreescu > Komu: Wikimedia developers > Datum: 18. 9. 2017 16:26:18 > Předmět: Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)? > "So, as things stand, rev_sha1 in the database is used for: > > 1. the XML dumps process and all the researche

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-18 Thread Danny B.
-- Původní e-mail -- Od: Dan Andreescu Komu: Wikimedia developers Datum: 18. 9. 2017 16:26:18 Předmět: Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)? "So, as things stand, rev_sha1 in the database is used for: 1. the XML dumps process and all the resear

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-18 Thread Dan Andreescu
So, as things stand, rev_sha1 in the database is used for: 1. the XML dumps process and all the researchers depending on the XML dumps (probably just for revert detection) 2. revert detection for libraries like python-mwreverts [1] 3. revert detection in mediawiki history reconstruction processes

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-18 Thread Daniel Kinzler
Am 16.09.2017 um 01:22 schrieb Matthew Flaschen: > On 09/15/2017 06:51 AM, Daniel Kinzler wrote: >> Also, I believe Roan is currently looking for a better mechanism for tracking >> all kinds of reverts directly. > > Let's see if we want to use rev_sha1 for that better solution (a way to track > re

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-17 Thread MZMcBride
Antoine Musso wrote: >I guess Aaron Halfaker, Brion Vibber, Aaron Schulz would have some >insights about it. Yes. Brion started a thread about the use of SHA-1 in February 2017: https://lists.wikimedia.org/pipermail/wikitech-l/2017-February/087664.html https://lists.wikimedia.org/pipermail/wikite

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-16 Thread Antoine Musso
On 15/09/2017 12:51, Daniel Kinzler wrote: I'm working on the database schema for Multi-Content-Revisions (MCR) and I'd like to get rid of the rev_sha1 field: Maintaining revision hashes (the rev_sha1 field) is expensive

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Gergo Tisza
At a quick glance, EventBus and FlaggedRevs are the two extensions using the hashes. EventBust just puts them into the emitted data; FlaggedRevs detects reverts to the latest stable revision that way (so there is no rev_sha1 based lookup in either case, although in the case of FlaggedRevs I could i

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Matthew Flaschen
On 09/15/2017 06:51 AM, Daniel Kinzler wrote: Also, I believe Roan is currently looking for a better mechanism for tracking all kinds of reverts directly. Let's see if we want to use rev_sha1 for that better solution (a way to track reverts within MW itself) before we drop it. I know Roan is

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Daniel Kinzler
Ok, a little more detail here: For MCR, we would have to keep around the hash of each content object ("slot") AND of each revision. This makes the revision and content tables "wider", which is a problem because they grow quite "tall", too. It also means we have to compute a hash of hashes for each

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Daniel Kinzler
A revert restores a previous revision. It covers all slots. The fact that reverts, watching, protecting, etc still works per page, while you can have multiple kinds of different content on the page, is indeed the point of MCR. Am 15.09.2017 um 22:23 schrieb C. Scott Ananian: > Alternatively, perh

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Daniel Kinzler
Am 15.09.2017 um 19:49 schrieb Erik Zachte: > Compute the hashes on the fly for the offline analysis doesn’t work for > Wikistats 1.0, as it only parses the stub dumps, without article content, > just metadata. > Parsing the full archive dumps is a quite expensive, time-wise. We can always compu

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Chad
gt; > > Erik Zachte > > > > -Original Message- > > From: Wikitech-l [mailto:wikitech-l-boun...@lists.wikimedia.org] On > > Behalf Of Daniel Kinzler > > Sent: Friday, September 15, 2017 12:52 > > To: Wikimedia developers > > Subject: [Wikitech-l

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread C. Scott Ananian
Alternatively, perhaps "hash" could be an optional part of an MCR chunk? We could keep it for the wikitext, but drop the hash for the metadata, and drop any support for a "combined" hash over wikitext + all-other-pieces. ...which begs the question about how reverts work in MCR. Is it just the wik

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Stas Malyshev
Hi! On 9/15/17 1:06 PM, Andrew Otto wrote: >> As a random idea - would it be possible to calculate the hashes > when data is transitioned from SQL to Hadoop storage? > > We take monthly snapshots of the entire history, so every month we’d > have to pull the content of every revision ever made :o

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Andrew Otto
> can it be a dataset generated from each revision and then published separately? Perhaps it be generated asynchronously via a job? Either stored in revision or a separate table. On Fri, Sep 15, 2017 at 4:06 PM, Andrew Otto wrote: > > As a random idea - would it be possible to calculate the ha

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Andrew Otto
> As a random idea - would it be possible to calculate the hashes when data is transitioned from SQL to Hadoop storage? We take monthly snapshots of the entire history, so every month we’d have to pull the content of every revision ever made :o On Fri, Sep 15, 2017 at 4:01 PM, Stas Malyshev wro

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Stas Malyshev
Hi! > We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but > from the little I know: > > Most analytical computations (for things like reverts, as you say) don’t > have easy access to content, so computing SHAs on the fly is pretty hard. > MediaWiki history reconstruction rel

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread James Hare
> Behalf Of Daniel Kinzler > Sent: Friday, September 15, 2017 12:52 > To: Wikimedia developers > Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)? > > Hi all! > > I'm working on the database schema for Multi-Content-Revisions (MCR) < > https://www.mediawiki

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Andrew Otto
Wikistats 2.0 with has a totally different process > flow. That I can't tell. > > Erik Zachte > > -Original Message- > From: Wikitech-l [mailto:wikitech-l-boun...@lists.wikimedia.org] On > Behalf Of Daniel Kinzler > Sent: Friday, September 15, 2017 12:52 > To: Wi

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Erik Zachte
process flow. That I can't tell. Erik Zachte -Original Message- From: Wikitech-l [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Daniel Kinzler Sent: Friday, September 15, 2017 12:52 To: Wikimedia developers Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)? H

[Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Daniel Kinzler
Hi all! I'm working on the database schema for Multi-Content-Revisions (MCR) and I'd like to get rid of the rev_sha1 field: Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more expensive with MCR.