Re: [twitter-dev] Academic data release

Pete Warden Wed, 24 Feb 2010 12:43:25 -0800

The value lies in the particular properties of a real social graph, as
opposed to an artificially generated one. The sort of questions it's useful
for are primarily social rather than mathematical. For a summary of some
existing research on similar data sets, see:


http://petewarden.typepad.com/searchbrowser/2010/02/social-network-data-and-research.html

On Wed, Feb 24, 2010 at 11:18 AM, M. Edward (Ed) Borasky
<zn...@cesmail.net>wrote:

> Quoting Pete Warden <p...@petewarden.com>:
>
>  I'm looking into releasing a data set based on information pulled from the
>> Twitter API. It would be a free release limited to academic researchers,
>> an
>> anonymized version of the network connections of several million users
>> with
>> public profiles.
>>
>> What I'm hoping to release is something like this:
>> <user id>, <city-level location>, <follower ids>, <friend ids>
>>
>> In all cases, the ids are arbitrary identifiers that are not convertible
>> to
>> actual Twitter ids, and any detailed locations are converted to the
>> nearest
>> large city.
>>
>> I'm aware that it may be possible to de-anonymize some of these users
>> based
>> on topology, but since much richer information is available through the
>> API
>> on these users anyway, that seems unlikely to be an issue? However I'm
>> obviously keen to hear any concerns that Twitter (or other developers
>> here)
>> may have before I go forward with this.
>>
>
> What is the value of such a dataset to an "academic researcher"? I consider
> myself an academic researcher, though I don't have a formal position as one.
> What can you do with a "real" Twitter "social graph" that you can't do with
> one generated by random techniques based on statistical sampling of Twitter
> data?
>
> A million-user "real" social graph, even assuming fewer than 5,000
> friend_ids and follower_ids per user, costs two million API calls. At 350
> calls per hour, that works out to 238 days by my calculation. And during
> that 238 days, the social graph is changing many times a second. A
> randomly-generated graph of a much larger size could be constructed in a
> day, *including* coding time, *and* you could incorporate the changing
> nature of Twitter social graphs in a simulation.
>
> (Smiling at the subtle irony in my standard email signature) ;-)
>
> --
> M. Edward (Ed) Borasky
> borasky-research.net/m-edward-ed-borasky/
>
> "A mathematician is a device for turning coffee into theorems." ~ Paul
> Erdos
>

Re: [twitter-dev] Academic data release

Reply via email to