[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 Dimitris Kontokostas changed: What|Removed |Added CC||jimk...@gmail.com --- Comment #8 from Dimitris Kontokostas --- Without looking at the code we had a few comments on this extension that also relate to performance [1] [2]. The basic idea is whether the client (wiki mirror) can delay the poll (e.g. 5-10 minutes) and aggregate multiple edits of the same page into one. OAI had a native support for this. The reason is that in DBpedia Live, we had a lot of scaling issues when the database was in constant update (even with the aggregated edits with 5-10 minutes poll delay) and using in parallel the MW API to get data for extraction. I don't know if this is supported by the PubSubHubbub specs but the client could do some aggregation manually. We had another query for Hub data persistence but not sure which thread is more appropriate for this question. [1] https://lists.wikimedia.org/pipermail/wikidata-tech/2014-July/000533.html [2] https://lists.wikimedia.org/pipermail/wikidata-tech/2014-July/000539.html -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 Aaron Schulz changed: What|Removed |Added Assignee|aschulz4...@gmail.com |wikibugs-l@lists.wikimedia. ||org -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 --- Comment #7 from Daniel Kinzler --- The updates uses the XML dump format. That's not really compatible with the way the API structures XML output, and in case JSON is requested from the API, would would have to be encoded as a single string value. That's really annoying. Also, API modules generally shouldn't mess with HTTP headers, which is necessary for PuSH, as far as I understand. A Special page seems to fit the bill much better on the technical level, though it's a bit awkward conceptually. There are several special pages that generate non-HTML output though. Hm, perhaps the best way to implement this would be a custom action?... -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 --- Comment #6 from Brad Jorsch --- I tested it before posting comment 4, just to be sure. It does. You can now test it yourself. Go to http://tools.wmflabs.org/oauth-hello-world/index.php?action=authorize, then go to http://tools.wmflabs.org/oauth-hello-world/index.php?action=testspecial which will dump the response received when hitting Special:MyPage. -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 --- Comment #5 from Aaron Schulz --- I don't think it works on special pages. I asked Chris about that a few days back. -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 --- Comment #4 from Brad Jorsch --- (In reply to Aaron Schulz from comment #3) > * It's not a huge deal, but it would be nice to use the API (which also > works with things like OAuth if we wanted hubs with more access). Brad might > have an opinion on special page vs API. I don't feel strongly, but we should > get it right given the Link header and cache interaction. OAuth works with special pages too. It sounds like the general idea of the extension is possible to be configured for general users to use? Then an API endpoint wouldn't necessarily be out of place. On the other hand, is there anything wrong with the existing special page implementation? If not, is there a reason not to just stay with that? -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 Aaron Schulz changed: What|Removed |Added CC||bjor...@wikimedia.org --- Comment #3 from Aaron Schulz --- * The delay could be in theory be based on the current max slave lag...but I simple approach would be to just pick a value. 10 seconds would be very safe under non-broken circumstances. * It's not a huge deal, but it would be nice to use the API (which also works with things like OAuth if we wanted hubs with more access). Brad might have an opinion on special page vs API. I don't feel strongly, but we should get it right given the Link header and cache interaction. * You could still use the recent changes table to reduce the number of jobs. There could be at most one de-duplicated job per hub that would grab all titles changes from the last time and send them to the hub. When the job succeeds (not HTTP errors posting the changed URIs), it could bump the time. The time table could be a simple DB table. The job could be delayed to give it time to cover a larger time range. The range could be "last time" to present (or a smaller range to limit the number of items, especially for the new hub case). Maybe 5 minutes. It could hopefully batch more titles in one or a few HTTP requests (or use pipelining or at least some curl_multi). -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 --- Comment #2 from Alexander Lehmann --- Hi Aaron. Thanks for your notes. Some questions to your notes are inline. (In reply to Aaron Schulz from comment #1) > A few things: > * This seems to be missing a ArticleRevisionVisibilitySet handler > * http_post() needs to handle $wgHTTPProxy > * The maximum jobs attempts for the queue will have to be set very high to > avoid update losses (maxTries) > * NewRevisionFromEditComplete and other hooks trigger before COMMIT so the > jobs should probably be delayed (using 'jobReleaseTimestamp'). Moving them > post-COMMIT is not an option since the network partition could cause nothing > to be enqueued (the reverse, a job and no COMMIT, is wasteful but harmless). I'm not sure if I understood the problem correctly. How should I set the time period of the delay? Should the delay dynamically adjust or a should we set a fixed period of time. > * We tend to run lots of jobs for one wiki at a time. http_post() could > benefit from some sort of singleton on the curl handle instead closing it > each time. See > http://stackoverflow.com/questions/972925/persistent-keepalive-http-with-the- > php-curl-library. > * createLastChangesOutput() should use LIMIT+DISTINCT instead of a "break" > statement. Also, I'm not sure how well that works. There can only be one job > for hitting the URL that returns this result in the queue, but it only does > the last 60 seconds of changes. Also, it selects rc_timestamp but does not > use it now. Is it OK if the Hub missing a bunch of changes from this (e.g. > are the per-Title jobs good enough?) > * It's curious that the hub is supposed to talk back to a special page, why > not an API page instead? The extension provides data in the MediaWiki XML export format, I don't know if the API is intended for this format. Does it make a real difference to using a special page? > * The Link headers also go there. What is the use of these? Also, since > they'd take 30 days to apply to all pages (the varnish cache TTL), it would > be a pain to change them. They definitely need to be stable. > They need to be there according to the specification of PubSubHubbub[1]. But they are stable. > Come to think of it, it seems like we need to send the following events to > the hub: > * New revisions > * Page (un)deletions > * Revision (un)deletions > * Page moves > All of the above leave either edit or log entries in recent changes. Page > moves only leave one at the old title...though rc_params can be inspected to > get the new title. I wonder if instead of a job per title if there can > instead be a single job that sends all changes since the "last update time" > and updates the "last update time" on success. The advantages would be: > a) Far fewer jobs needed > b) All updates would be batched > c) Supporting more hubs is easier since only another job and time position > is needed (rather than N jobs for each hub for each title) > Of course I may have missed something. Unfortunately, there is no possibility to identify the request of the Hubs. Theoretically, anyone can call the PubSubHubbub export interface. Therefore, we do not know the "last update time". The hub cannot provide the time, because the resource is identified only through the exact URL. -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 Andre Klapper changed: What|Removed |Added CC||alexander.lehmann@student.h ||pi.uni-potsdam.de, ||sebastian.brueckner@student ||.hpi.uni-potsdam.de Component|WikidataRepo|PubSubHubbub -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 --- Comment #1 from Aaron Schulz --- A few things: * This seems to be missing a ArticleRevisionVisibilitySet handler * http_post() needs to handle $wgHTTPProxy * The maximum jobs attempts for the queue will have to be set very high to avoid update losses (maxTries) * NewRevisionFromEditComplete and other hooks trigger before COMMIT so the jobs should probably be delayed (using 'jobReleaseTimestamp'). Moving them post-COMMIT is not an option since the network partition could cause nothing to be enqueued (the reverse, a job and no COMMIT, is wasteful but harmless). * We tend to run lots of jobs for one wiki at a time. http_post() could benefit from some sort of singleton on the curl handle instead closing it each time. See http://stackoverflow.com/questions/972925/persistent-keepalive-http-with-the-php-curl-library. * createLastChangesOutput() should use LIMIT+DISTINCT instead of a "break" statement. Also, I'm not sure how well that works. There can only be one job for hitting the URL that returns this result in the queue, but it only does the last 60 seconds of changes. Also, it selects rc_timestamp but does not use it now. Is it OK if the Hub missing a bunch of changes from this (e.g. are the per-Title jobs good enough?) * It's curious that the hub is supposed to talk back to a special page, why not an API page instead? * The Link headers also go there. What is the use of these? Also, since they'd take 30 days to apply to all pages (the varnish cache TTL), it would be a pain to change them. They definitely need to be stable. Come to think of it, it seems like we need to send the following events to the hub: * New revisions * Page (un)deletions * Revision (un)deletions * Page moves All of the above leave either edit or log entries in recent changes. Page moves only leave one at the old title...though rc_params can be inspected to get the new title. I wonder if instead of a job per title if there can instead be a single job that sends all changes since the "last update time" and updates the "last update time" on success. The advantages would be: a) Far fewer jobs needed b) All updates would be batched c) Supporting more hubs is easier since only another job and time position is needed (rather than N jobs for each hub for each title) Of course I may have missed something. -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 Sam Reed (reedy) changed: What|Removed |Added Blocks|38970 | -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 Greg Grossmeier changed: What|Removed |Added Blocks||67623 -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 Greg Grossmeier changed: What|Removed |Added Priority|Unprioritized |Normal CC||g...@wikimedia.org -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 Sam Reed (reedy) changed: What|Removed |Added Keywords|performance | -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 Aaron Schulz changed: What|Removed |Added Assignee|wikidata-bugs@lists.wikimed |aschulz4...@gmail.com |ia.org | -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 Lydia Pintscher changed: What|Removed |Added Blocks||38970 -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 Sebastian Brückner changed: What|Removed |Added Keywords||performance CC||daniel.kinz...@wikimedia.de ||, ||lydia.pintscher@wikimedia.d ||e -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 67117] Performance review of PubSubHubbub extension
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 Sebastian Brückner changed: What|Removed |Added See Also||https://bugzilla.wikimedia. ||org/show_bug.cgi?id=67118 -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l