[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2019-11-21 Thread ArielGlenn
ArielGlenn added a comment. In T199121#5682911 , @Nuria wrote: > I see this ticket is resolved but the dumps on commons have version version="0.10" since from this ticket i gather that the dumps that contain those slots are version=11

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2019-11-22 Thread daniel
daniel added a comment. In T199121#5683594 , @ArielGlenn wrote: > In T199121#5682911 , @Nuria wrote: > >> I see this ticket is resolved but the dumps on commons have version versio

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2019-11-22 Thread ArielGlenn
ArielGlenn added a comment. In T199121#5684237 , @daniel wrote: > In T199121#5683594 , @ArielGlenn wrote: > >> In T199121#5682911 ,

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2019-11-22 Thread daniel
daniel added a comment. In T199121#5684250 , @ArielGlenn wrote: > In T199121#5684237 , @daniel wrote: > >> In T199121#5683594 , @A

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2019-11-22 Thread ArielGlenn
ArielGlenn added a comment. In T199121#5684397 , @daniel wrote: > In T199121#5684250 , @ArielGlenn wrote: > >> ... >> https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/4

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-07-09 Thread daniel
daniel added a comment. Note: Not part of the MVP for SDC.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: danielCc: Aklapper, Fjalapeno, ArielGlenn, daniel, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF,

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-07-19 Thread ArielGlenn
ArielGlenn added a comment. This is surely the wrong place to drop a link, but it can always be moved. Have some thoughts about MCR, two-pass dumps, and the XML schema: https://www.mediawiki.org/wiki/User:ArielGlenn/MCR_and_dumpsTASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCESh

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-02 Thread ArielGlenn
ArielGlenn added a comment. Existing proposals (which were the occasion for my comments at the link above): https://www.mediawiki.org/wiki/Multi-Content_Revisions/DumpsTASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferen

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-02 Thread ArielGlenn
ArielGlenn added a comment. Notes on timing: it i expected that commons MCR writes (metadata for media) will be happening by Oct 1, so it would be really really nice to have an rfc approved and code written by then. That's a pretty short time frame given that it's summer vacation right now, but oth

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-02 Thread daniel
daniel added a comment. In T199121#4437696, @ArielGlenn wrote: This is surely the wrong place to drop a link, but it can always be moved. Have some thoughts about MCR, two-pass dumps, and the XML schema: https://www.mediawiki.org/wiki/User:ArielGlenn/MCR_and_dumps It's totally the right place :)

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-06 Thread ArielGlenn
ArielGlenn added a comment. I'm adding here the tables and fields that need to be part of the dumps, both for export and for import, so everyone is on the same page. slots: slot_revision_id slot_role_id slot_content_id slot_origin content: content_id content_size content_sha1 content_model

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-06 Thread ArielGlenn
ArielGlenn added a comment. Some initial comments/questions on slot-roles, content_models tables: We could dump the slot_roles and content_models tables separately if we wanted to; the mechanism for adding those is easy enough. But we might also/instead want to add them to the header which appears

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-06 Thread ArielGlenn
ArielGlenn added a comment. Making clear here the correspondence between revisions, slots, content, text, and comparing that to the previous setup with just revisions and text. Until now: page -> one or more revisions, each revision linked to exactly one entry in text table, each entry in text tab

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-06 Thread ArielGlenn
ArielGlenn added a comment. Because MCR content on Commons, and specifically the metadata storage piece, is set to go live on October 1st, and we likely will barely have an RFC out by that time if we are lucky, we will not be giving adoptees much time to convert their existing utilities to use the

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-08 Thread ArielGlenn
ArielGlenn added a comment. @daniel I am about to steal a bunch of your preliminary work and comments in order to craft a workable proposal (see the link in the task description now); for this reason you are listed as a co-author of the RFC. If you would rather not, please say so and I'll just give

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-08 Thread daniel
daniel added a comment. @ArielGlenn I'm happy to be a co-author, I just can't invest much time into this right now. Thanks for taking care of this!TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, d

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-10 Thread ArielGlenn
ArielGlenn added a comment. The draft at https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps is ready for a first round of comments by people on this ticket (or people just following along). Anything from 'that schema is wro

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-10 Thread daniel
daniel added a comment. I'm not sure the transitional format is useful. What client would use it, and what for? A client that wants to process the transitional format would need to implement handling for two different XML structures, and actually use both in parallel. A client that still relies on

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-10 Thread daniel
daniel added a comment. Introducing a tag that wraps a tag and additional tags for meta-data looks nice and clean, but it's also completely incompatible with existing consumers. Just generating multiple tags, and addition any new info as attributes on them, would allow existing consumer code th

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-10 Thread Tgr
Tgr added a comment. Or just drop the idea of a transitional format and keep the main slot one level higher forever (or at least until the next change that really needs to be a BC break). Old clients will not break, MCR-aware clients will maybe need very slightly less elegant parsing code, which do

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-10 Thread daniel
daniel added a comment. CR-aware clients will maybe need very slightly less elegant parsing code, which does not seem like a bad trade-off. It's an option, but it feels like we'd be shifting the paint to consumers, who have to maintain that duplicate logic forever. And I don't see how it's better

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-10 Thread Tgr
Tgr added a comment. In T199121#4493904, @daniel wrote: And I don't see how it's better than the multiple-text-tags-with-attributes approach. How safe is the assumption that an XML parsing library that expects a tag to be unique will just ignore extra occurrences and return the first tag (as opp

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-10 Thread daniel
daniel added a comment. SAX parsers would typically take the last and get confused True, that's a risk - they would then be using the wrong slot.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, da

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-13 Thread brion
brion added a comment. My concern with the two-step transition idea is that some consumers may not update on a reliable schedule, or may not be able to do so easily. For instance, if people are using Special:Export on one wiki and Special:Import'ing those pages on another that's *not* a Wikimedia-h

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-13 Thread brion
brion added a comment. Ok, proposed transitional schema looks like it imports cleanly via importDump (which uses same code path as Special:Import). The proposed final schema, however, imports a revision with empty text (and throws a notice on Undefined index: text in /vagrant/mediawiki/includes/imp

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-13 Thread brion
brion added a comment. As for role ids -- perhaps we should primarily use the names, not the numbers, in the bit. It's analogous to a page's reference (a primary identifier) not to its or (which are provided informatively if you want to repro the database exactly, but can be freely discarded wh

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-13 Thread ArielGlenn
ArielGlenn added a comment. Thanks for the input so far! I'm just as happy not to expose content model ids. Relying on slot role names exclusively while hiding the ids makes me a little nervous, only because I fear those names may be changed in the future from time to time. Dumps processors would

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-14 Thread ArielGlenn
ArielGlenn added a comment. I've updated the draft RFC to remove the 'final' schema, leaving the 'transitional' schema as the new schema proposal; I've munged the 'header changes' section leaving my question about possible changes to slot role names in there for comment. Still thinking about @danie

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-14 Thread daniel
daniel added a comment. question about possible changes to slot role names Slot role names are public identifiers, they cannot change, ever. Same for model names.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/T

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-23 Thread ArielGlenn
ArielGlenn added a comment. I've updated the draft RFC to include solt role name instead of slot role id. I'm still thinking about the multiple text element with attributes, though leaning against it. Question: what is address="tt:12345" in the alternative proposal here https://www.mediawiki.org/

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-24 Thread daniel
daniel added a comment. Question: what is address="tt:12345" in the alternative proposal here https://www.mediawiki.org/wiki/Multi-Content_Revisions/Dumps ? That's the replacement for the text id, for use with stub dumps. We now have a blob store that resolves url-style addresses. The idea is the

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-27 Thread ArielGlenn
ArielGlenn added a comment. That was very helpful, thanks. Okay, here's my take. At some point in the future (unknown when), we might lose the text table; we'd have to have someplace for third-party installations to store their revision text, and whatever that mechanism is (not necessarily an ext

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-08-27 Thread daniel
daniel added a comment. we'd have to have someplace for third-party installations to store their revision text, and whatever that mechanism is (not necessarily an external store), would be supported along with the external store. We also need to support installations that do not enable MCR. What th

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-19 Thread daniel
daniel added a comment. I'd really like to move this forward. Ideally, we'd get the new dump format into the 1.32 release. @ArielGlenn is there anything holding this back? Can we have an IRC discussion on this soon?TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabric

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-19 Thread ArielGlenn
ArielGlenn added a comment. Sorry, I've been trying to get a couple other things off my plate. Will be coming back to this shortly.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: tstarling, awig

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-21 Thread daniel
daniel added a comment. I just realized that the proposed dump format is still using numeric text IDs. That cannot be guaranteed to work, text blobs are now identified by URL-like blob addresses: "tt:12345" is the address of text row 12345, and we may start using "ext:DB:..." for ExternalStore soon

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-26 Thread ArielGlenn
ArielGlenn added a comment. I wish I'd had a pointer to the changes in Storage/Blobstore.php and SqlBlobStore.php earlier, didn't realize this new form of text addressing was baked in already. Anyways, I've revised the draft accordingly, sorry for the long delay. Main slot still only gets an id num

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-26 Thread daniel
daniel added a comment. Hi Ariel! Thanks for updating. Sorry for the confusion about the address format, I thought you knew this was already in. This is what the content_address field contains. I tried to explain the idea in August, see T199121#4529353. Main slot still only gets an id number for t

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-26 Thread ArielGlenn
ArielGlenn added a comment. I wasn't happy with the different formats for the same attribute name either, using a different name for the content text id is grand! Using anyURI makes for a simpler schema too. I've added the location attr as optional for the main slot, with the intent that it gets

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-26 Thread daniel
daniel added a comment. I've added the location attr as optional for the main slot, with the intent that it gets used when we have some other schema than 'tt' in play. The next version would have the id attr as optional + deprecated for the main slot, only permitted to be written for blobs with the

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-27 Thread ArielGlenn
ArielGlenn added a comment. In T199121#4619450, @daniel wrote: ... Sounds mostly good, except that we have to bump the XML schema when we want to do T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production). If we made id optional & deprecated right away, we wouldn't

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-09-27 Thread daniel
daniel added a comment. I'm reluctant to change anything about the formatting of the main slot this round. There are tools that convert xml dumps to sql suitable for import, and the text id may be used by some of these as a convenience for constructing the text table entry. Such tools will have to

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-01 Thread ArielGlenn
ArielGlenn added a comment. This will be discussed at the TechCom meeting Wednesday, October 3rd at 2pm PST(21:00 UTC, 23:00 CET). The announcement was sent to Wikitech-l: https://lists.wikimedia.org/pipermail/wikitech-l/2018-September/090881.html I have also sent email about it to the xmldatadump

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-03 Thread kchapman
kchapman added a comment. TechCom hosted an IRC meeting on this today: minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-10-03-21.00.html log: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-10-03-21.00.log.html TASK DETAILhttps:/

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread Tgr
Tgr added a comment. Some issues that we did not have time to fully discuss during the meeting: sha1 B/C. There are two candidates for the old sha1 field: the sha1 of the main slot and the sha1 of the full revision (which is computed as taking the base36 sha1 of slot 1, concatenating the raw valu

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread daniel
daniel added a comment. Quick addendum to @Tgr's last point: in theory a lot of resources could be saved if identical slot contents are only written out once (they will be a very frequent occurrence due to reverts) Reverts are not a new problem, and not the largest problem. Inherited slots are: I

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread ArielGlenn
ArielGlenn added a comment. Following up on the deduplication issue raised above: The main concern about bloat with the new schema, as I understand it, is that it may be common for only one slot's content to change, in which case we don't want to write out the content for the other slots. This is

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread ArielGlenn
ArielGlenn added a comment. I'll make the changes agreed upon in last night's meeting to the RFC a bit later today and will note here when they are done.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGl

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread daniel
daniel added a comment. Side note re the id attribute becoming optional: hasn't it always be (formally) optional, because it was only emitted for stubs?TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGle

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread ArielGlenn
ArielGlenn added a comment. https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps#Schema This has now been updated. In T199121#4641212, @daniel wrote: Side note re the id attribute becoming optional: hasn't it always be (for

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread daniel
daniel added a comment. Note that text location and id attrs might be omitted in the case where the text is deleted (in practice, this is what the code does), and yet id has never been marked as optional, so I am not marking location as optional either. I'd rather argue that the spec should be fix

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread ArielGlenn
ArielGlenn added a comment. (Sorry for the near-stream-of-consciousness updates here, just trying to Get **It Done.) To get the sha1 discussion started: I can't imagine any script is going to care about the revision sha1 as separate from the sha1 of the content of each slot separately; if you wan

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread daniel
daniel added a comment. I can't imagine any script is going to care about the revision sha1 as separate from the sha1 of the content of each slot separately; if you want to know if an edit was reverted, you can look at the sha1 of the individual slots, and I imagine that much analysis will focus on

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-04 Thread daniel
daniel added a comment. Re the ìd`attribute being optional or not: turns out, it's optional already: The "use" attribute of the element in an xml schema is indeed optional, and its default value is "optional" :) The fact that this is declared for all other attributes but omitted for id is confusin

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread ArielGlenn
ArielGlenn added a comment. In T199121#4643036, @daniel wrote: ... Anything using the existing revision level sha1 for revert detection will miss-detect a revert (or a null-edit) for *all* revisions that did not affect the main slot. While analysis on the slot level may be useful, existing an

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread daniel
daniel added a comment. @Halfak having per-slot hashes is not controversial, the question is what to do with the tag that currently exists on the revision level. If we make this the has of the main slot, we will break the assumption that two revisions that have the same hash there have the same co

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread Halfak
Halfak added a comment. Oh. I guess my sense is to drop the existing in favor of the 's related to individual content slots. It doesn't make sense anymore, and we're breaking the schema anyway to add new content slots.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread daniel
daniel added a comment. Well, we are trying to keep compatibility for most clients that just ignore stuff they don't know. They would still be able to process the dumps as before. Removing the tag would be cleaner and safer, but it would also be a hard B/C break. So, IF we keep it, should we keep

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread FaFlo
FaFlo added a comment. Of course checksums make lot of of sense for countless use cases, including many in research (mentioned paper was never intended to make a sweeping point to the contrary, but yes, discussion for another time). And I think MCR is awesome, JFTR. Regarding the question at ha

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread daniel
daniel added a comment. Ok, so if we use the tag on the revision level for the main tag hash, where do we put the revision hash? Or do we just ignore it? Note that the revision has is in the database, and is exposed via the API. It's not made up for the purpose of the dumps.TASK DETAILhttps://phab

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread Tgr
Tgr added a comment. Just add a tag?TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, TgrCc: FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion,

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread Halfak
Halfak added a comment. where do we put the revision hash? What is the "revision hash"?TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, HalfakCc: FaFlo, Halfak, vrandezo, Denny, kchapman, tstarlin

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-09 Thread daniel
daniel added a comment. What is the "revision hash"? A combined hash the identifies the content of the slot across all revisions. It's stored in rev_sha1 in the database, and used by stuff on labs to detect manual reverts. It's also available from the API as rvprop=sha1.TASK DETAILhttps://phabrica

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-10 Thread ArielGlenn
ArielGlenn added a comment. In T199121#4643642, @daniel wrote: Re the ìd`attribute being optional or not: turns out, it's optional already: The "use" attribute of the element in an xml schema is indeed optional, and its default value is "optional" :) The fact that this is declared for all other a

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-10 Thread Halfak
Halfak added a comment. re. the "revision hash", it seems that this has already been determined so I'm not sure what other insights I might give. But FWIW, the combined rev_sha1 seems very crazy :) If there is combined rev_sha1 that is built by any strategy (crazy or not) from the database, then

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-10 Thread Tgr
Tgr added a comment. @Halfak the questions is: given that XML readers are somewhat flexible, so a script that was written some time ago and has no knowledge of MCR will still be able to read MCR dumps and see all the fields it expects, and it will quietly ignore non-main slots as unknown fields, wh

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-12 Thread mako
mako added a comment. I think it would be more surprising to have the same SHA1 for two different (in any way) revisions than it would to have a SHA1 that reflects something other than the SHA1 of the primary/historical content field. So I guess I like the revision-level reflecting the hash of al

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-15 Thread daniel
daniel added a comment. @Halfak to clarify - you originally said the top level tag should be the hash of the main slot's content, but you now let yourself be convinced that it's more useful to have the combined revision hash there?TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENC

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-15 Thread daniel
daniel added a comment. As the above conversation seems to converge on having the combined hash in the revision level tag, this raises the question where to put the main slot's content hash. I personally prefer to use an attribute on the tag, but we could also go for or .TASK DETAILhttps://phabr

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-16 Thread Halfak
Halfak added a comment. For clarity, I was originally advocating that we didn't combine any hashes and that instead we provided a tag in each of the slots. I now see that we're going to make a mess in favor of backwards compatibility. So there will continue to be a tag and at the top of the

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-18 Thread ArielGlenn
ArielGlenn added a comment. OK, it looks like everyone's weighed in. so I'll suggest: 308722154 ... text/x-wiki <-- contains sha1 of content in main slot a9kdtqq3buy5tribez2u0ad4b6fdxq2 <-- revision sha1 I'm not excited about the attribute but I like it better than an

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-18 Thread daniel
daniel added a comment. Looks good to me!TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, danielCc: mako, FaFlo, Halfak, vrandezo, Denny, kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nem

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-19 Thread kchapman
kchapman added a comment. TechCom has moved Last Call to end Wednesday 31 October 2pm PST (21:00 UTC, 22:00 CET). This is due to no TechCom meeting next week to vote on final approval.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/pan

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-22 Thread ArielGlenn
ArielGlenn added a comment. I've added the Sha1 as an attribute of the Text element in all slots; this means removing it as an element from the Content element. I think the new markup reflects these changes correctly, please double-check though. As to 'optional' or 'required', I'd like to remove a

[Wikidata-bugs] [Maniphest] [Commented On] T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps

2018-10-29 Thread ArielGlenn
ArielGlenn added a comment. I've updated the rfc proposal removing all 'use=optional' markup as described above. This should probably get brought up during last call at this week's TechCom to make sure everyone's ok with it.TASK DETAILhttps://phabricator.wikimedia.org/T199121EMAIL PREFERENCEShttps: