[twitter-dev] Re: Tweet Corpus creation for NLP research

jayb Fri, 17 Apr 2009 13:44:14 -0700

I've been collecting tweets for about a week for a project (http://
www.happn.in).


Some characteristics of my current dataset:
* Begin around April 10th 2009
* Collected from users who are located nearby 26 US cities
* ~5,000,000 tweets
* Growing at ~800,000 per day
* ~900MB in mysql
* ~375,000 users
* ~21,000 users in one sample city (Boston)

If you, kanny, or anyone else is interested in using them for research
or projects or anything else, let me know.

Jay

On Apr 8, 11:26 am, kanny <fruhl...@coolgoose.com> wrote:
> I am interested to do something deeper than the surface-level
> processing of a user's incoming tweets. For this, I will need to
> create a corpus of the user's friends_timeline over, say, past one
> month or any computationally feasible period. Basically, a large
> enough set of, say, 1-100 Million tweets for someone following
> 100-1000 people. It would be only a one-time download, as afterwards,
> incremental downloads should suffice.
>
> This would translate into 100MB-10 GB of download for a user. It could
> be less for people following less or less-active people. Does Twitter
> API provide support for such corpus creation ? It could be very
> helpful for Natural Language Processing research if Twitter creates
> some sample corpus of public_timeline or some selected user's
> timelines.
>
> Looking forward to some help in this regard.
> Thanks

[twitter-dev] Re: Tweet Corpus creation for NLP research

Reply via email to