[twitter-dev] Re: Streaming API: Spritzer-stream coverage
On Tue, May 26, 2009 at 10:07 PM, elversatile elversat...@gmail.com wrote: Makes sense. I was assuming the same. Thanks people! John from Twitter said that spritzer is 1/3 of the gardenhose, which makes it 15%. So I guess statistical insignificance of spritzer is due to its low percentage. I'm also curious what statistical insignificance means in this context, since in the Streaming API docs they're pretty assiduous saying which are significant vs. insignificant. Sample sizes far lower than 4% are of course fine for certain purposes as long as they're drawn uniformly. And even if not all that uniform, they might still be good enough :) There are so many different things to do with *hose/spritzer I'm not sure what statistical significance means in the abstract. I'm seeing hundreds of thousands of messages per day on /spritzer. If you're interested in computing a statistic that holds across all tweets -- say, average tweet length -- that's *plenty*. (Now, if you wanted to compute the statistic per 1 minute time window and cared about minute-per-minute differences, the story might be different...) I'm curious to know what the docs author meant by statistically (in)significant here. Brendan [ http://anyall.org ]
[twitter-dev] Re: Streaming API: Spritzer-stream coverage
Folks, The significant/insignificant language currently isn't that important or clear, as we're preparing for future changes. The spritzer will likely remain a small public sample, the gardenhose will likely remain a larger sample that requires an EULA. The proportions, however, are subject to continuous change -- we want to provide a useful flow, but, at the same time, we don't want to incur excessive cost or overwhelm clients. Given our traffic growth, we will probably have to trim rates down -- few clients want a 5 mbit/sec spritzer feed. We haven't, yet, worked out a model for adjusting the sampling proportions. The sampling may be based on some public model of statistical significance, it may be driven by practical matters, by client requirements, some unknown factor, or some combination of them all. We're still measuring, analyzing, and reasoning about the Streaming API, and there's plenty we don't know just yet. -John Kalucki Services, Twitter Inc. On May 26, 11:55 pm, Brendan O'Connor breno...@gmail.com wrote: On Tue, May 26, 2009 at 10:07 PM, elversatile elversat...@gmail.com wrote: Makes sense. I was assuming the same. Thanks people! John from Twitter said that spritzer is 1/3 of the gardenhose, which makes it 15%. So I guess statistical insignificance of spritzer is due to its low percentage. I'm also curious what statistical insignificance means in this context, since in the Streaming API docs they're pretty assiduous saying which are significant vs. insignificant. Sample sizes far lower than 4% are of course fine for certain purposes as long as they're drawn uniformly. And even if not all that uniform, they might still be good enough :) There are so many different things to do with *hose/spritzer I'm not sure what statistical significance means in the abstract. I'm seeing hundreds of thousands of messages per day on /spritzer. If you're interested in computing a statistic that holds across all tweets -- say, average tweet length -- that's *plenty*. (Now, if you wanted to compute the statistic per 1 minute time window and cared about minute-per-minute differences, the story might be different...) I'm curious to know what the docs author meant by statistically (in)significant here. Brendan [http://anyall.org]
[twitter-dev] Re: Streaming API: Spritzer-stream coverage
Hi Sven, well I merely assumed that the easiest way for twitter to send a subset of tweets on spitzer was to send them based on their ids (autoincrement integer)... watching at the stream, I noticed that all the ids where ending with 000,001,002,003,004, 100,102, ... 900,901,... 904 I did not push the analysis further though On May 26, 3:24 am, Sven Svensson twitterf...@gmail.com wrote: Hi Stephane, I used the following calculation to obtain a four percent estimate for the spritzer stream: tweets_seen_in_stream / (max_tweet_id_seen_in_stream - min_tweet_id_seen_in_stream) Did you use the same methodology? The four percent is probably a bit too low as I assume private tweets get tweet_id:s too, which makes the denominator a bit too large due to private tweets being included. On Mon, May 25, 2009 at 11:39 PM, stephane stephane.philipa...@gmail.com wrote: looking at the tweet ids it looks like the spitzer stream delivers 5 tweets every hundreds this would make it a 5% of the firehose am i correct? Stephane http://www.twazzup.com
[twitter-dev] Re: Streaming API: Spritzer-stream coverage
Makes sense. I was assuming the same. Thanks people! John from Twitter said that spritzer is 1/3 of the gardenhose, which makes it 15%. So I guess statistical insignificance of spritzer is due to its low percentage. Any explanation directly from Twitter? On May 26, 6:01 pm, stephane stephane.philipa...@gmail.com wrote: Hi Sven, well I merely assumed that the easiest way for twitter to send a subset of tweets on spitzer was to send them based on their ids (autoincrement integer)... watching at the stream, I noticed that all the ids where ending with 000,001,002,003,004, 100,102, ... 900,901,... 904 I did not push the analysis further though On May 26, 3:24 am, Sven Svensson twitterf...@gmail.com wrote: Hi Stephane, I used the following calculation to obtain a four percent estimate for the spritzer stream: tweets_seen_in_stream / (max_tweet_id_seen_in_stream - min_tweet_id_seen_in_stream) Did you use the same methodology? The four percent is probably a bit too low as I assume private tweets get tweet_id:s too, which makes the denominator a bit too large due to private tweets being included. On Mon, May 25, 2009 at 11:39 PM, stephane stephane.philipa...@gmail.com wrote: looking at the tweet ids it looks like the spitzer stream delivers 5 tweets every hundreds this would make it a 5% of the firehose am i correct? Stephane http://www.twazzup.com
[twitter-dev] Re: Streaming API: Spritzer-stream coverage
How are spritzer statuses sampled? Are they picked uniformly at random? Or is there some logic behind it? Also, what makes it statistically insignificant? Is it its percentage in relation to the entire stream or the way it is sampled? Thanks, -Eldar On May 24, 8:23 pm, John Kalucki jkalu...@gmail.com wrote: Sven, Excluding connection ramp-up and ramp-down skew, each spritzer feed delivers the same statuses as all other spritzer feeds. Likewise, each gardenhose feed delivers the same statuses as all other gardenhose feeds. Also, spritzer feeds are a strict subset of gardenhose feeds. There's no point in consuming multiple sampled feeds (spritzer/ spritzer, gardenhose/spritzer, gardenhose/gardenhose), as you'll just receive duplicate data. Multiple sessions on sampled feeds just waste scarce resources and you also may find your access automatically limited for a period of time. Reduce, reuse, recycle! -John Kalucki Services, Twitter Inc. On May 24, 10:51 am, Sven Svensson twitterf...@gmail.com wrote: Thanks for an excellent API. I have two questions in relation to the streaming API: * Assume that two users are both reading the spritzer stream at the same time - will they get the same spritzer streams covering the same subset of all tweets, or will they get two separate spritzer streams covering different tweets? * Roughly what percentage of all tweets are distributed in the spritzer stream? Is it in the region of four percent of all tweets (my guesstimate)? Thanks!
[twitter-dev] Re: Streaming API: Spritzer-stream coverage
looking at the tweet ids it looks like the spitzer stream delivers 5 tweets every hundreds this would make it a 5% of the firehose am i correct? Stephane http://www.twazzup.com On May 25, 12:17 am, elversatile elversat...@gmail.com wrote: How are spritzer statuses sampled? Are they picked uniformly at random? Or is there some logic behind it? Also, what makes it statistically insignificant? Is it its percentage in relation to the entire stream or the way it is sampled? Thanks, -Eldar On May 24, 8:23 pm, John Kalucki jkalu...@gmail.com wrote: Sven, Excluding connection ramp-up and ramp-down skew, each spritzer feed delivers the same statuses as all other spritzer feeds. Likewise, each gardenhose feed delivers the same statuses as all other gardenhose feeds. Also, spritzer feeds are a strict subset of gardenhose feeds. There's no point in consuming multiple sampled feeds (spritzer/ spritzer, gardenhose/spritzer, gardenhose/gardenhose), as you'll just receive duplicate data. Multiple sessions on sampled feeds just waste scarce resources and you also may find your access automatically limited for a period of time. Reduce, reuse, recycle! -John Kalucki Services, Twitter Inc. On May 24, 10:51 am, Sven Svensson twitterf...@gmail.com wrote: Thanks for an excellent API. I have two questions in relation to the streaming API: * Assume that two users are both reading the spritzer stream at the same time - will they get the same spritzer streams covering the same subset of all tweets, or will they get two separate spritzer streams covering different tweets? * Roughly what percentage of all tweets are distributed in the spritzer stream? Is it in the region of four percent of all tweets (my guesstimate)? Thanks!