We could keep it in the XML dumps (it's part of the XSD after all)...just
compute it at export time. Not terribly hard, I don't think, we should have
the parsed content already on hand....

-Chad

On Fri, Sep 15, 2017 at 12:51 PM James Hare <jamesmh...@gmail.com> wrote:

> What I wonder is – does this *need* to be a part of the database table, or
> can it be a dataset generated from each revision and then published
> separately? This way each user wouldn’t have to individually compute the
> hashes while we also get the (ostensible) benefit of getting them out of
> the table.
>
> On September 15, 2017 at 12:41:03 PM, Andrew Otto (o...@wikimedia.org)
> wrote:
>
> We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
> from the little I know:
>
> Most analytical computations (for things like reverts, as you say) don’t
> have easy access to content, so computing SHAs on the fly is pretty hard.
> MediaWiki history reconstruction relies on the SHA to figure out what
> revisions revert other revisions, as there is no reliable way to know if
> something is a revert other than by comparing SHAs.
>
> See
>
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
> (particularly the *revert* fields).
>
>
>
> On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte <ezac...@wikimedia.org>
> wrote:
>
> > Compute the hashes on the fly for the offline analysis doesn’t work for
> > Wikistats 1.0, as it only parses the stub dumps, without article content,
> > just metadata.
> > Parsing the full archive dumps is a quite expensive, time-wise.
> >
> > This may change with Wikistats 2.0 with has a totally different process
> > flow. That I can't tell.
> >
> > Erik Zachte
> >
> > -----Original Message-----
> > From: Wikitech-l [mailto:wikitech-l-boun...@lists.wikimedia.org] On
> > Behalf Of Daniel Kinzler
> > Sent: Friday, September 15, 2017 12:52
> > To: Wikimedia developers <wikitech-l@lists.wikimedia.org>
> > Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
> >
> > Hi all!
> >
> > I'm working on the database schema for Multi-Content-Revisions (MCR) <
> > https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema>
> > and I'd like to get rid of the rev_sha1 field:
> >
> > Maintaining revision hashes (the rev_sha1 field) is expensive, and
> becomes
> > more expensive with MCR. With multiple content objects per revision, we
> > need to track the hash for each slot, and then re-calculate the sha1 for
> > each revision.
> >
> > That's expensive especially in terms of bytes-per-database-row, which
> > impacts query performance.
> >
> > So, what do we need the rev_sha1 field for? As far as I know, nothing in
> > core uses it, and I'm not aware of any extension using it either. It
> seems
> > to be used primarily in offline analysis for detecting (manual) reverts
> by
> > looking for revisions with the same hash.
> >
> > Is that reason enough for dragging all the hashes around the database
> with
> > every revision update? Or can we just compute the hashes on the fly for
> the
> > offline analysis? Computing hashes is slow since the content needs to be
> > loaded first, but it would only have to be done for pairs of revisions of
> > the same page with the same size, which should be a pretty good
> > optimization.
> >
> > Also, I believe Roan is currently looking for a better mechanism for
> > tracking all kinds of reverts directly.
> >
> > So, can we drop rev_sha1?
> >
> > --
> > Daniel Kinzler
> > Principal Platform Engineer
> >
> > Wikimedia Deutschland
> > Gesellschaft zur Förderung Freien Wissens e.V.
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to