> To clarify, does this mean that each (non-protected) user has an equal > probability of showing up in the stream regardless of how often they > tweet?
Nope. The stream is a sample of statuses as they are posted. Each status has an equal probability of being selected. This isn't a user sampling mechanism. Rather, it gives you a random sampling of users, weighted by their posting rate. Still, this is probably the best source of data to build a corpus for language detection experiementation. If you get results, good or bad, please post them here -- doubly so if you get over ~80-85% accuracy. -John On Oct 12, 9:24 am, Ryan Rosario <uclamath...@gmail.com> wrote: > > That sample will be biased towards more active posters and may include > > some demographic biases due to seasonal activities during the limited > > time frame of the sample. > > That answers my question, and that is what I was afraid of. I think > for my purposes (language detection), a random sample of active users > is fine. I just wanted to get opinions. > > > The Streaming API sample method would provide a random sampling of > > public users weighted by update rate, not a random sampling of all > > users. The default 'spritzer' should be sufficient for most uses. > > To clarify, does this mean that each (non-protected) user has an equal > probability of showing up in the stream regardless of how often they > tweet? > > Thanks, > Ryan > > On Oct 12, 8:31 am, Chris Babcock <cbabc...@kolonelpanic.org> wrote: > > > > I am doing some research using the Twitter API and I would like to get > > > a random sample of Twitter users. Any ideas of how this can be > > > accomplished? > > > Here's a start:http://en.wikipedia.org/wiki/Sampling_(statistics) > > > At this point you are asking for a sampling method without providing an > > adequate definition of the population. > > > > So far, I have scraped 2 weeks from the Streaming API and extracted 3 > > > million user IDs from the stream. Any arguments as to whether or not > > > this could constitute random? > > > That sample will be biased towards more active posters and may include > > some demographic biases due to seasonal activities during the limited > > time frame of the sample. > > > Chris Babcock