John,
I am not polling. I am simply trying to implement a basic "refresh" feature like every desktop/mobile Twitter app has. Basically, I just want to let users scroll through their timelines, and be reasonably sure that I am presenting them with an accurate & complete view of the timeline, while using as little bandwidth as possible. When I said "10 seconds old"/"30 seconds old"/etc. I was referring to I was referring to the age at the time the page of tweets was generated. So, basically, if the tweet's timestamp - the response's Last-Modified time more than 10,000 ms (from what you said below), you are almost definitely getting At Least Once behavior if Twitter is operating normally, and you can use that information to get At Least Once behavior that emulates Exactly Once behavior with little (usually no) overhead. Is that a correct interpretation of what you were saying? Thanks, Brian From: twitter-development-talk@googlegroups.com [mailto:twitter-development-t...@googlegroups.com] On Behalf Of John Kalucki Sent: Friday, April 09, 2010 3:31 PM To: twitter-development-talk@googlegroups.com Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced Your second paragraph doesn't quite make sense. The period between your next poll and the timestamp of the last status is irrelevant. The issue is solely the magnitude of K on the roughly sorted stream of events that are applied to the materialized timeline vector. As K varies, so do the odds, however infinitesimally small, that you will miss a tweet using the last status id returned. The period between your polls of the API does not affect this K. My recommendation is to ignore this issue in nearly every use case. If you are, however, polling high velocity timelines (including search queries) and attempting to approximate an Exactly Once QoS, you should, basically, stop doing that. You are probably wasting resources and you'll probably never get Exactly Once behavior anyway. Use the Streaming API instead. -John Kalucki http://twitter.com/jkalucki Infrastructure, Twitter Inc. On Fri, Apr 9, 2010 at 12:20 PM, Brian Smith <br...@briansmith.org> wrote: John, Thank you. That was one of the most informative emails on the Twitter API I have seen on the list. Basically, even now, an application should not use an ID of a tweet for since_id if the tweet is less than 10 seconds old, ignoring service abnormalities. Probably a larger threshold (30 seconds or even a minute) would be better, especially when you take into consideration the likelihood of clock skew between the servers that generate the timestamps. I think this is information that would be useful to have added to the API documentation, as I know many applications are taking a much more naive approach to pagination. Thanks again, Brian From: twitter-development-talk@googlegroups.com On Behalf Of John Kalucki Sent: Friday, April 09, 2010 1:20 PM To: twitter-development-talk@googlegroups.com Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced Folks are making a lot of incorrect assumptions about the Twitter architecture, especially around how we materialize and present timeline vectors and just what QoS we're really offering. This new scheme does not significantly, or perhaps even observably, make the existing issues around since_id any better or any worse. And I'm being very precise here. The since_id situation is such that the few milliseconds skew possible in Snowflake are practically irrelevant and lost in the noise of a 4 to 6 orders-of-magnitude misconception. (That's a very big misconception.) If you do not know the rough ordering of our event stream as it applied to the materialized timeline vectors and also the expected rate of change on the timeline in question, you cannot make good choices about making since_id perfect. But, neither you should you try to make it perfect, nor should you have to worry about this. If you insist upon worrying about this, here's my slight salting of Mark's advice: In the existing continuously increasing id generation scheme on the Twitter.com API, I'd subtract about 5000 ids from since_id to ensure sufficient overlap in nearly all cases, but even this could be lossy in the face of severe operational issues -- issues of a type that we haven't seen in many many months. The search API has a different K in its rough ordering, so you might need more like 10,000 ids. In the new Snowflake scheme, I'd overlap by about 5000 milliseconds for twitter.com APIs and 10,000 ms for search APIs. Despite all this, things still could go wrong. An engineer here is known for pointing out that even things that almost never ever happen, happen all the time on the Twitter system. Now, just because they are happening, to someone, all the time, doesn't mean that they'll ever ever happen to you or your users in a thousand years -- but some's getting hit with it, somewhere, a few times a day. The above schemes no longer treat the id as an opaque unique ordered identifier. And woe lies in wait for you as changes are made to these ids. Woe. You also need to deduplicate. Be very careful and understand fully what you summon by breaking this semantic contract. In the end, since_id issues go away on the Streaming API, and other than around various start-up discontinuities, you can ignore this issue. I'll be talking about Rough Ordering, among other things Streaming, at the Chirp conference. Come geek out. -John Kalucki http://twitter.com/jkalucki Infrastructure, Twitter Inc. On Fri, Apr 9, 2010 at 1:58 AM, Dave Sherohman <d...@fishtwits.com> wrote: On Thu, Apr 08, 2010 at 05:03:29PM -0700, Naveen wrote: > However, I wanted to be clear and feel it should be made obvious that > with this change, there is a possibility that a tweet may not be > delivered to client if the implementation of how since_id is currently > used is not updated to cover the case. I still envision the situation > as more likely than you seem to believe and figure as tweet velocity > increases, the likelihood will also increase; But I am assuming have > better data to support your viewpoint than I and shall defer. Maybe I'm just missing something here, but it seems trivial to fix on Twitter's side (enough so that I assume it's what they've been planning from the start to do): Only return tweets from closed buckets. We are guaranteed that the buckets will be properly ordered. The order will only be randomized within a bucket. Therefore, by only returning tweets from buckets which are no longer receiving new tweets, since_id works and will never miss a tweet. And, yes, this does mean a slight delay in getting the tweets out because they have to wait a few milliseconds for their bucket to close before being exposed to calls which can use since_id, plus maybe a little longer for the contents of that bucket to be distributed to multiple servers. That's still going to only take time comparable to round-trip times for an HTTP request to fetch the data for display to a user and be far, far less than the average refresh delay required by those clients which fall under the API rate limit. I submit, therefore, that any such delay caused by waiting for buckets to close will be inconsequential. -- Dave Sherohman -- To unsubscribe, reply using "remove me" as the subject.