[Bug 54328] Make it possible for edit diff to be provided as a raw text
https://bugzilla.wikimedia.org/show_bug.cgi?id=54328 --- Comment #3 from Aaron Halfaker aaron.halfa...@gmail.com --- 1. Character vs. line offset I'd much rather represent diffs based on a character offset I'm afraid of representing position with something like lineno since linebreaks are differently defined between systems. Character offsets would also allow us to make changes to our diff detection strategy without changing the output. 2. Machine readable vs. human readable diffs Machine readable diff opcode formats tend to represent the full set of operations used to recreate a revision -- not just the context. A common format that I'm familiar with would something like this: a = These are wrd. b = These are words. { diff: [ { op: equal, a_start: 0, a_end: 10, b_start: 0 b_end: 10 }, { op: remove, a_start: 10, a_end: 13, b_start: 10, b_end: 10, content: wrd, }, { op: insert, a_start: 13, a_end: 13, b_start: 10, b_end: 15, content: words, }, { op: equal, a_start: 13, a_end: 14, b_start: 15, b_end: 16 } ] } 3. compressed format: I don't see the value in compressing the format given that the API doesn't really let you query for more than one diff at a time and diffs tend to be represented in few operations. However, we could simply represent each operation as a tuple with agreed upon field order: { op: insert, a_start: 13, a_end: 13, b_start: 15, b_end: 18, content: foo } could be [ insert, 13, 13, 15, 18, foo ] or if we really want to get a tight format (since the rest of the fields are derivable in a sequence of operations). [ insert, 15, foo ] -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 54328] Make it possible for edit diff to be provided as a raw text
https://bugzilla.wikimedia.org/show_bug.cgi?id=54328 --- Comment #4 from Brad Jorsch bjor...@wikimedia.org --- (In reply to Aaron Halfaker from comment #3) 1. Character vs. line offset I'd much rather represent diffs based on a character offset I'm afraid of representing position with something like lineno since linebreaks are differently defined between systems. Isn't that an argument for line-based rather than chatacter-based offsets? Character offsets would also allow us to make changes to our diff detection strategy without changing the output. 2. Machine readable vs. human readable diffs Machine readable diff opcode formats tend to represent the full set of operations used to recreate a revision -- not just the context. OTOH, what is the usual use of querying the diffs? I suspect it's more often that the client is wanting to display a human-readable diff to the end user than because the client is wanting to do the equivalent of the 'patch' utility on an already-downloaded local copy of the article. and diffs tend to be represented in few operations. On talk pages, maybe. But someone heavily copyediting an article is likely to generate a huge number of operations. With the way the diff algorithm works, even some simple edits will generate many operations as it tries to match up individual letters in the old vs new paragraphs. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 54328] Make it possible for edit diff to be provided as a raw text
https://bugzilla.wikimedia.org/show_bug.cgi?id=54328 Peter Bena benap...@gmail.com changed: What|Removed |Added Blocks||55793 --- Comment #2 from Peter Bena benap...@gmail.com --- I think that for beginning splitting new text and old text would be enough, right now it's hard to find out what was added by user and what was there before they edited the page -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 54328] Make it possible for edit diff to be provided as a raw text
https://bugzilla.wikimedia.org/show_bug.cgi?id=54328 Peter Bena benap...@gmail.com changed: What|Removed |Added Priority|Unprioritized |Normal Severity|normal |enhancement -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 54328] Make it possible for edit diff to be provided as a raw text
https://bugzilla.wikimedia.org/show_bug.cgi?id=54328 --- Comment #1 from Brad Jorsch bjor...@wikimedia.org --- The data structure would have to be rather more complicated than that. At first guess, something along the lines of (in JSON): diff: [ { line: 1, type: context, content: Line }, { line: 2, type: removed, old: Line }, { line: 2, type: added, new: Line }, { line: 3, type: context, content: Line }, { line: 47, type: context, content: Line }, { line: 48, type: changed, old: Line, new: Line }, { line: 49, type: context, content: Line } ] If you want indication in the line of what changed for changed types, that's another complication. Instead of just Line it would have to be an array of fragments. One simple way might be that even array indexes are unchanged and odd are changed: old: [ foo bar , , quux , poop, ], new: [ foo bar , baz , quux , etc., ] That might indicate that baz was inserted into the list and poop at the end was replaced with etc.. Or maybe it would be better to combine old and new into one datastructure somehow. Also, keep in mind that lots of little objects can use a surprising amount of memory (see bug 53663). -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l