[twitter-dev] Re: Streaming API: Spritzer-stream coverage

2009-05-27 Thread Brendan O'Connor

On Tue, May 26, 2009 at 10:07 PM, elversatile elversat...@gmail.com wrote:

 Makes sense. I was assuming the same. Thanks people! John from Twitter
 said that spritzer is 1/3 of the gardenhose, which makes it 15%. So I
 guess statistical insignificance of spritzer is due to its low
 percentage.

I'm also curious what statistical insignificance means in this
context, since in the Streaming API docs they're pretty assiduous
saying which are significant vs. insignificant.  Sample sizes far
lower than 4% are of course fine for certain purposes as long as
they're drawn uniformly.  And even if not all that uniform, they might
still be good enough :)

There are so many different things to do with *hose/spritzer I'm not
sure what statistical significance means in the abstract.  I'm seeing
hundreds of thousands of messages per day on /spritzer.  If you're
interested in computing a statistic that holds across all tweets --
say, average tweet length -- that's *plenty*.  (Now, if you wanted to
compute the statistic per 1 minute time window and cared about
minute-per-minute differences, the story might be different...)

I'm curious to know what the docs author meant by statistically
(in)significant here.

Brendan
[ http://anyall.org ]


[twitter-dev] Re: Streaming API: Spritzer-stream coverage

2009-05-27 Thread John Kalucki

Folks,

The significant/insignificant language currently isn't that important
or clear, as we're preparing for future changes. The spritzer will
likely remain a small public sample, the gardenhose will likely remain
a larger sample that requires an EULA. The proportions, however, are
subject to continuous change -- we want to provide a useful flow, but,
at the same time, we don't want to incur excessive cost or overwhelm
clients. Given our traffic growth, we will probably have to trim rates
down -- few clients want a 5 mbit/sec spritzer feed.

We haven't, yet, worked out a model for adjusting the sampling
proportions. The sampling may be based on some public model of
statistical significance, it may be driven by practical matters, by
client requirements, some unknown factor, or some combination of them
all. We're still measuring, analyzing, and reasoning about the
Streaming API, and there's plenty we don't know just yet.

-John Kalucki
Services, Twitter Inc.




On May 26, 11:55 pm, Brendan O'Connor breno...@gmail.com wrote:
 On Tue, May 26, 2009 at 10:07 PM, elversatile elversat...@gmail.com wrote:

  Makes sense. I was assuming the same. Thanks people! John from Twitter
  said that spritzer is 1/3 of the gardenhose, which makes it 15%. So I
  guess statistical insignificance of spritzer is due to its low
  percentage.

 I'm also curious what statistical insignificance means in this
 context, since in the Streaming API docs they're pretty assiduous
 saying which are significant vs. insignificant.  Sample sizes far
 lower than 4% are of course fine for certain purposes as long as
 they're drawn uniformly.  And even if not all that uniform, they might
 still be good enough :)

 There are so many different things to do with *hose/spritzer I'm not
 sure what statistical significance means in the abstract.  I'm seeing
 hundreds of thousands of messages per day on /spritzer.  If you're
 interested in computing a statistic that holds across all tweets --
 say, average tweet length -- that's *plenty*.  (Now, if you wanted to
 compute the statistic per 1 minute time window and cared about
 minute-per-minute differences, the story might be different...)

 I'm curious to know what the docs author meant by statistically
 (in)significant here.

 Brendan
 [http://anyall.org]


[twitter-dev] Re: Streaming API: Spritzer-stream coverage

2009-05-26 Thread stephane

Hi Sven,

well I merely assumed that the easiest way for twitter to send a
subset of tweets on spitzer was to send them based on their ids
(autoincrement integer)...
watching at the stream, I noticed that all the ids where ending with
000,001,002,003,004, 100,102, ...  900,901,... 904

I did not push the analysis further though

On May 26, 3:24 am, Sven Svensson twitterf...@gmail.com wrote:
 Hi Stephane,

 I used the following calculation to obtain a four percent estimate for
 the spritzer stream:
   tweets_seen_in_stream / (max_tweet_id_seen_in_stream -
 min_tweet_id_seen_in_stream)

 Did you use the same methodology?

 The four percent is probably a bit too low as I assume private tweets
 get tweet_id:s too, which makes the denominator a bit too large due to
 private tweets being included.

 On Mon, May 25, 2009 at 11:39 PM, stephane

 stephane.philipa...@gmail.com wrote:
  looking at the tweet ids it looks like the spitzer stream delivers 5 tweets 
  every hundreds
  this would make it a 5% of the firehose

  am i correct?

  Stephane
 http://www.twazzup.com


[twitter-dev] Re: Streaming API: Spritzer-stream coverage

2009-05-26 Thread elversatile

Makes sense. I was assuming the same. Thanks people! John from Twitter
said that spritzer is 1/3 of the gardenhose, which makes it 15%. So I
guess statistical insignificance of spritzer is due to its low
percentage. Any explanation directly from Twitter?

On May 26, 6:01 pm, stephane stephane.philipa...@gmail.com wrote:
 Hi Sven,

 well I merely assumed that the easiest way for twitter to send a
 subset of tweets on spitzer was to send them based on their ids
 (autoincrement integer)...
 watching at the stream, I noticed that all the ids where ending with
 000,001,002,003,004, 100,102, ...  900,901,... 904

 I did not push the analysis further though

 On May 26, 3:24 am, Sven Svensson twitterf...@gmail.com wrote:

  Hi Stephane,

  I used the following calculation to obtain a four percent estimate for
  the spritzer stream:
    tweets_seen_in_stream / (max_tweet_id_seen_in_stream -
  min_tweet_id_seen_in_stream)

  Did you use the same methodology?

  The four percent is probably a bit too low as I assume private tweets
  get tweet_id:s too, which makes the denominator a bit too large due to
  private tweets being included.

  On Mon, May 25, 2009 at 11:39 PM, stephane

  stephane.philipa...@gmail.com wrote:
   looking at the tweet ids it looks like the spitzer stream delivers 5 
   tweets every hundreds
   this would make it a 5% of the firehose

   am i correct?

   Stephane
  http://www.twazzup.com


[twitter-dev] Re: Streaming API: Spritzer-stream coverage

2009-05-25 Thread elversatile

How are spritzer statuses sampled? Are they picked uniformly at
random? Or is there some logic behind it?

Also, what makes it statistically insignificant? Is it its
percentage in relation to the entire stream or the way it is sampled?

Thanks,
-Eldar

On May 24, 8:23 pm, John Kalucki jkalu...@gmail.com wrote:
 Sven,

 Excluding connection ramp-up and ramp-down skew, each spritzer feed
 delivers the same statuses as all other spritzer feeds. Likewise, each
 gardenhose feed delivers the same statuses as all other gardenhose
 feeds. Also, spritzer feeds are a strict subset of gardenhose feeds.
 There's no point in consuming multiple sampled feeds (spritzer/
 spritzer, gardenhose/spritzer, gardenhose/gardenhose), as you'll just
 receive duplicate data.

 Multiple sessions on sampled feeds just waste scarce resources and you
 also may find your access automatically limited for a period of time.
 Reduce, reuse, recycle!

 -John Kalucki
 Services, Twitter Inc.

 On May 24, 10:51 am, Sven Svensson twitterf...@gmail.com wrote:

  Thanks for an excellent API.

  I have two questions in relation to the streaming API:

  * Assume that two users are both reading the spritzer stream at the same
  time - will they get the same spritzer streams covering the same subset of
  all tweets, or will they get two separate spritzer streams covering
  different tweets?

  * Roughly what percentage of all tweets are distributed in the spritzer
  stream? Is it in the region of four percent of all tweets (my guesstimate)?

  Thanks!


[twitter-dev] Re: Streaming API: Spritzer-stream coverage

2009-05-25 Thread stephane

looking at the tweet ids it looks like the spitzer stream delivers 5
tweets every hundreds
this would make it a 5% of the firehose

am i correct?

Stephane
http://www.twazzup.com

On May 25, 12:17 am, elversatile elversat...@gmail.com wrote:
 How are spritzer statuses sampled? Are they picked uniformly at
 random? Or is there some logic behind it?

 Also, what makes it statistically insignificant? Is it its
 percentage in relation to the entire stream or the way it is sampled?

 Thanks,
 -Eldar

 On May 24, 8:23 pm, John Kalucki jkalu...@gmail.com wrote:

  Sven,

  Excluding connection ramp-up and ramp-down skew, each spritzer feed
  delivers the same statuses as all other spritzer feeds. Likewise, each
  gardenhose feed delivers the same statuses as all other gardenhose
  feeds. Also, spritzer feeds are a strict subset of gardenhose feeds.
  There's no point in consuming multiple sampled feeds (spritzer/
  spritzer, gardenhose/spritzer, gardenhose/gardenhose), as you'll just
  receive duplicate data.

  Multiple sessions on sampled feeds just waste scarce resources and you
  also may find your access automatically limited for a period of time.
  Reduce, reuse, recycle!

  -John Kalucki
  Services, Twitter Inc.

  On May 24, 10:51 am, Sven Svensson twitterf...@gmail.com wrote:

   Thanks for an excellent API.

   I have two questions in relation to the streaming API:

   * Assume that two users are both reading the spritzer stream at the same
   time - will they get the same spritzer streams covering the same subset of
   all tweets, or will they get two separate spritzer streams covering
   different tweets?

   * Roughly what percentage of all tweets are distributed in the spritzer
   stream? Is it in the region of four percent of all tweets (my 
   guesstimate)?

   Thanks!