We could keep it in the XML dumps (it's part of the XSD after all)...just compute it at export time. Not terribly hard, I don't think, we should have the parsed content already on hand....
-Chad On Fri, Sep 15, 2017 at 12:51 PM James Hare <jamesmh...@gmail.com> wrote: > What I wonder is – does this *need* to be a part of the database table, or > can it be a dataset generated from each revision and then published > separately? This way each user wouldn’t have to individually compute the > hashes while we also get the (ostensible) benefit of getting them out of > the table. > > On September 15, 2017 at 12:41:03 PM, Andrew Otto (o...@wikimedia.org) > wrote: > > We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but > from the little I know: > > Most analytical computations (for things like reverts, as you say) don’t > have easy access to content, so computing SHAs on the fly is pretty hard. > MediaWiki history reconstruction relies on the SHA to figure out what > revisions revert other revisions, as there is no reliable way to know if > something is a revert other than by comparing SHAs. > > See > > https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history > (particularly the *revert* fields). > > > > On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte <ezac...@wikimedia.org> > wrote: > > > Compute the hashes on the fly for the offline analysis doesn’t work for > > Wikistats 1.0, as it only parses the stub dumps, without article content, > > just metadata. > > Parsing the full archive dumps is a quite expensive, time-wise. > > > > This may change with Wikistats 2.0 with has a totally different process > > flow. That I can't tell. > > > > Erik Zachte > > > > -----Original Message----- > > From: Wikitech-l [mailto:wikitech-l-boun...@lists.wikimedia.org] On > > Behalf Of Daniel Kinzler > > Sent: Friday, September 15, 2017 12:52 > > To: Wikimedia developers <wikitech-l@lists.wikimedia.org> > > Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)? > > > > Hi all! > > > > I'm working on the database schema for Multi-Content-Revisions (MCR) < > > https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema> > > and I'd like to get rid of the rev_sha1 field: > > > > Maintaining revision hashes (the rev_sha1 field) is expensive, and > becomes > > more expensive with MCR. With multiple content objects per revision, we > > need to track the hash for each slot, and then re-calculate the sha1 for > > each revision. > > > > That's expensive especially in terms of bytes-per-database-row, which > > impacts query performance. > > > > So, what do we need the rev_sha1 field for? As far as I know, nothing in > > core uses it, and I'm not aware of any extension using it either. It > seems > > to be used primarily in offline analysis for detecting (manual) reverts > by > > looking for revisions with the same hash. > > > > Is that reason enough for dragging all the hashes around the database > with > > every revision update? Or can we just compute the hashes on the fly for > the > > offline analysis? Computing hashes is slow since the content needs to be > > loaded first, but it would only have to be done for pairs of revisions of > > the same page with the same size, which should be a pretty good > > optimization. > > > > Also, I believe Roan is currently looking for a better mechanism for > > tracking all kinds of reverts directly. > > > > So, can we drop rev_sha1? > > > > -- > > Daniel Kinzler > > Principal Platform Engineer > > > > Wikimedia Deutschland > > Gesellschaft zur Förderung Freien Wissens e.V. > > > > _______________________________________________ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > > _______________________________________________ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l