Folks,

The significant/insignificant language currently isn't that important
or clear, as we're preparing for future changes. The spritzer will
likely remain a small public sample, the gardenhose will likely remain
a larger sample that requires an EULA. The proportions, however, are
subject to continuous change -- we want to provide a useful flow, but,
at the same time, we don't want to incur excessive cost or overwhelm
clients. Given our traffic growth, we will probably have to trim rates
down -- few clients want a 5 mbit/sec spritzer feed.

We haven't, yet, worked out a model for adjusting the sampling
proportions. The sampling may be based on some public model of
statistical significance, it may be driven by practical matters, by
client requirements, some unknown factor, or some combination of them
all. We're still measuring, analyzing, and reasoning about the
Streaming API, and there's plenty we don't know just yet.

-John Kalucki
Services, Twitter Inc.




On May 26, 11:55 pm, "Brendan O'Connor" <breno...@gmail.com> wrote:
> On Tue, May 26, 2009 at 10:07 PM, elversatile <elversat...@gmail.com> wrote:
>
> > Makes sense. I was assuming the same. Thanks people! John from Twitter
> > said that spritzer is 1/3 of the gardenhose, which makes it 15%. So I
> > guess statistical insignificance of spritzer is due to its low
> > percentage.
>
> I'm also curious what "statistical insignificance" means in this
> context, since in the Streaming API docs they're pretty assiduous
> saying which are "significant" vs. "insignificant".  Sample sizes far
> lower than 4% are of course fine for certain purposes as long as
> they're drawn uniformly.  And even if not all that uniform, they might
> still be good enough :)
>
> There are so many different things to do with *hose/spritzer I'm not
> sure what statistical significance means in the abstract.  I'm seeing
> hundreds of thousands of messages per day on /spritzer.  If you're
> interested in computing a statistic that holds across all tweets --
> say, average tweet length -- that's *plenty*.  (Now, if you wanted to
> compute the statistic per 1 minute time window and cared about
> minute-per-minute differences, the story might be different...)
>
> I'm curious to know what the docs author meant by "statistically
> (in)significant" here.
>
> Brendan
> [http://anyall.org]

Reply via email to