Hi, On Fri, Apr 30, 2010 at 6:28 PM, Bob Wyman <[email protected]> wrote: > Dave Cridland <[email protected]> wrote: > (Yes, we can reduce the > client's to simply be display devices for state maintained on intermediary > servers, but this is not, I think, ideal.)
I disagree. Caching works, as proved by HTTP. The problems with caching in the HTTP world are mostly caused by the lack of a good cache-update mechanism to make sure your caches are up-to-date. Taking the fat twitter notifications (with personal data, personal prefs and all that in the same notification) as an example, if you model each of those data units as a separate PEP node, and allow "last-mile" clients to use the pubsub caching server on his own domain, that cache server will be notified on each change to the information and it will be able to update his own clients. > A reliance on clients' maintaining state would also seem to assume that a > reasonably high percentage of the traffic shares message-independent > "static" information with messages received earlier and thus that cache-hit > rates are reasonably high. And thats true for both the fat twitter message, and the example Dave presented. A lot of the Atom metadata is not about the notification itself but about the source of the notification. Split that into a separate node. Client maintenance of state is most useful when > all messages have the same originator. It is least useful when every message > has a unique sender. I subscribe about 300 Atom/RSS feeds. That translates to 600 to 700 "notifications" per day. Right now, I'm using a classical Pull system, so the source metadata is shared with 10 to 30 notifications but most of those are waste because I already have them. If I switch to a Push system, I would reduce the wast because I would only get each notification once. On the other side, each of those notifications will send me all the source metadata that I don't need every time. So even with a short number of sources, extracting the source meta-data would be useful. > However, in the future, I'm fairly confident that we'll see an > increase in the number of systems that support "content-based" publish and > subscribe. Thus, we'll see messages being delivered because of their > content, not simply because of their author. This sort of thing will be very > much like the "Track" function that originally influenced, in part, > Twitter's adoption of "Atom over XMPP". In the "Track" use case, (when you > might subscribe to all messages containing the keyword "XMPP") you'll often > get messages from senders that you've never seen before or will never see > again. Thus, you'll often find that cache hit rates are lower than you'd > like even though you may dedicate a great deal of resource to maintaining > that cache. Lets model the problem a bit: a large number of users (several millions potentially) receive a small set of notifications (per user, lets say 2000 per day), from a large set of sources (the same as number of users in a balanced publisher/subscriber social network, although I think the lurkers >> the publishers). The question here is: do we send the source meta-data on each notification? Looking at the HTTP world, I can see a very simillar pattern with JavaScript frameworks like jQuery and Prototype. You have large set of users, each one browsing a small set of pages. Those pages share their use of JavaScript frameworks. Using CDN's like Google AJAX API (I think thats the correct name) or Yahoo! stuff, all those sites share the same URL to the JS framework, to make sure that all caches can reuse the same object for all sites, given better performance for everybody. If you assume that you move the source meta-data to a separate node, and don't include it on each notification, local cache systems can provide better performance even on those situations of content-based delivery because across you local server several clients will request the same source metadata with the same key. > So, we see that, at least, limitations in the XMPP protocol, resource > limitations on the clients, and a move towards cache-inefficient > content-based routing all tend to argue against an assumption that we can > rely on clients to maintain state... Client state is only limited by the local storage of the device, not by the nature of the current or possible future nature of the notifications (right now white-listed blogroll/followers list, future content-based tracking like Collecta). I argue that cache works on both situations if the information inside the notification is properly arranged so that common units have the same source address inside the multiple notifications. Bye, -- Pedro Melo http://www.simplicidade.org/ xmpp:[email protected] mailto:[email protected]
