> To clarify, does this mean that each (non-protected) user has an equal
> probability of showing up in the stream regardless of how often they
> tweet?

Nope. The stream is a sample of statuses as they are posted. Each
status has an equal probability of being selected. This isn't a user
sampling mechanism. Rather, it gives you a random sampling of users,
weighted by their posting rate.

Still, this is probably the best source of data to build a corpus for
language detection experiementation. If you get results, good or bad,
please post them here -- doubly so if you get over ~80-85% accuracy.

-John



On Oct 12, 9:24 am, Ryan Rosario <uclamath...@gmail.com> wrote:
> > That sample will be biased towards more active posters and may include
> > some demographic biases due to seasonal activities during the limited
> > time frame of the sample.
>
> That answers my question, and that is what I was afraid of. I think
> for my purposes (language detection), a random sample of active users
> is fine. I just wanted to get opinions.
>
> > The Streaming API sample method would provide a random sampling of
> > public users weighted by update rate, not a random sampling of all
> > users. The default 'spritzer' should be sufficient for most uses.
>
> To clarify, does this mean that each (non-protected) user has an equal
> probability of showing up in the stream regardless of how often they
> tweet?
>
> Thanks,
> Ryan
>
> On Oct 12, 8:31 am, Chris Babcock <cbabc...@kolonelpanic.org> wrote:
>
> > > I am doing some research using the Twitter API and I would like to get
> > > a random sample of Twitter users. Any ideas of how this can be
> > > accomplished?
>
> > Here's a start:http://en.wikipedia.org/wiki/Sampling_(statistics)
>
> > At this point you are asking for a sampling method without providing an
> > adequate definition of the population.
>
> > > So far, I have scraped 2 weeks from the Streaming API and extracted 3
> > > million user IDs from the stream. Any arguments as to whether or not
> > > this could constitute random?
>
> > That sample will be biased towards more active posters and may include
> > some demographic biases due to seasonal activities during the limited
> > time frame of the sample.
>
> > Chris Babcock

Reply via email to