[Bug 27618] Backup dumps could contain a title index
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618 Ángel González changed: What|Removed |Added CC||keis...@gmail.com --- Comment #9 from Ángel González 2011-11-14 15:46:09 UTC --- I used bzip2 boundary + title hash. If your index is 315 MB, even dropping the ability to perform random search, you will hardly be efficient in a consumer PC with maybe just 512 MB of RAM. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 27618] Backup dumps could contain a title index
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618 --- Comment #8 from Ariel T. Glenn 2011-11-14 15:23:35 UTC --- Out of curiosity, what do the various bz2 offline readers need, byte, or byte and bit, or bzip2 boundary and offset? I expect the offline readers don't really use namespace or page ids for anything, so adding the full page title (i.e. namespace:title) should suffice. If we're talking only about things in the main article space then it doesn't matter at all (but what about images?)... -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 27618] Backup dumps could contain a title index
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618 Diederik van Liere changed: What|Removed |Added CC||dvanli...@gmail.com --- Comment #7 from Diederik van Liere 2011-11-13 01:14:58 UTC --- I like this idea and I think two things need to be added to this patch: 1) Currently only the title is written to the index file, but that should also included the namespace or use the page_id instead of the title. 2) As Ariel mentioned, we are generating the dumps in multiple parts so the index file should also keep track in which file the article can be found. Best, Diederik -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 27618] Backup dumps could contain a title index
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618 Sumana Harihareswara changed: What|Removed |Added Keywords||need-review CC||suma...@panix.com --- Comment #6 from Sumana Harihareswara 2011-11-10 00:16:38 UTC --- Adding the need-review keyword because my impression is that Adam wanted other developers to check his approach and give feedback. Thanks for the patch, Adam! -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 27618] Backup dumps could contain a title index
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618 --- Comment #4 from Adam Wight 2011-03-18 06:45:44 UTC --- Created attachment 8310 --> https://bugzilla.wikimedia.org/attachment.cgi?id=8310 ROUGH Not much to show yet, but in case someone wants to lend a hand... My intention is that: * each backup job records the arguments with which it was invoked * an index entry is recorded for each page, giving its offset into the compressed data being generated Problems: 1) there is no convention for saving to a second file stream (the index file) 2) bz2 php library does not expose the libbz2.so "tell" function, nor could that function work without flushing buffers. Perhaps the recorded offset can be addressed by bz2 chunk, then by uncompressed offset. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 27618] Backup dumps could contain a title index
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618 --- Comment #3 from Adam Wight 2011-02-23 00:16:44 UTC --- Interesting-- Also, the byte offsets are into the compressed data of course, ftell(STDOUT), and the boundaries between bz2 chunks also becomes very relevant. Thanks, I'll have a patch for review this week! -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 27618] Backup dumps could contain a title index
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618 --- Comment #2 from Ariel T. Glenn 2011-02-23 00:12:29 UTC --- How will this work for runs that do parts in parallel? I still don't know if those pieces should be recombined later but at present we are running on the assumption that they should be. Not a big issue, it's just that you'll need to write a little script to recalculate the byte offsets for the combined dump when that phase runs, keeping track of the bit alignment to get the page start byte in later pieces right. This would be handy for a number of things actually, so I'd like to see it happen. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 27618] Backup dumps could contain a title index
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618 Mark A. Hershberger changed: What|Removed |Added Priority|Normal |High CC||ar...@wikimedia.org, ||m...@everybody.org AssignedTo|wikibugs-l@lists.wikimedia. |s...@ludd.net |org | --- Comment #1 from Mark A. Hershberger 2011-02-22 17:59:03 UTC --- (In reply to comment #0) > The simplest remedy would be to register a dump filter which creates a text > file mapping article title -> byte offset. If this is done during the backup > process, there is almost no resource overhead. > > I can write a patch if other developers agree this would be a worthwhile > pursuit. I'm interested. CCing Ariel for input and assigning to you. Let's have a patch! -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l