[twitter-dev] Academic data release

2010-02-24 Thread Pete Warden
I'm looking into releasing a data set based on information pulled from the
Twitter API. It would be a free release limited to academic researchers, an
anonymized version of the network connections of several million users with
public profiles.

What I'm hoping to release is something like this:
user id, city-level location, follower ids, friend ids

In all cases, the ids are arbitrary identifiers that are not convertible to
actual Twitter ids, and any detailed locations are converted to the nearest
large city.

I'm aware that it may be possible to de-anonymize some of these users based
on topology, but since much richer information is available through the API
on these users anyway, that seems unlikely to be an issue? However I'm
obviously keen to hear any concerns that Twitter (or other developers here)
may have before I go forward with this.

cheers,
Pete


Re: [twitter-dev] Academic data release

2010-02-24 Thread Pete Warden
The value lies in the particular properties of a real social graph, as
opposed to an artificially generated one. The sort of questions it's useful
for are primarily social rather than mathematical. For a summary of some
existing research on similar data sets, see:

http://petewarden.typepad.com/searchbrowser/2010/02/social-network-data-and-research.html

On Wed, Feb 24, 2010 at 11:18 AM, M. Edward (Ed) Borasky
zn...@cesmail.netwrote:

 Quoting Pete Warden p...@petewarden.com:

  I'm looking into releasing a data set based on information pulled from the
 Twitter API. It would be a free release limited to academic researchers,
 an
 anonymized version of the network connections of several million users
 with
 public profiles.

 What I'm hoping to release is something like this:
 user id, city-level location, follower ids, friend ids

 In all cases, the ids are arbitrary identifiers that are not convertible
 to
 actual Twitter ids, and any detailed locations are converted to the
 nearest
 large city.

 I'm aware that it may be possible to de-anonymize some of these users
 based
 on topology, but since much richer information is available through the
 API
 on these users anyway, that seems unlikely to be an issue? However I'm
 obviously keen to hear any concerns that Twitter (or other developers
 here)
 may have before I go forward with this.


 What is the value of such a dataset to an academic researcher? I consider
 myself an academic researcher, though I don't have a formal position as one.
 What can you do with a real Twitter social graph that you can't do with
 one generated by random techniques based on statistical sampling of Twitter
 data?

 A million-user real social graph, even assuming fewer than 5,000
 friend_ids and follower_ids per user, costs two million API calls. At 350
 calls per hour, that works out to 238 days by my calculation. And during
 that 238 days, the social graph is changing many times a second. A
 randomly-generated graph of a much larger size could be constructed in a
 day, *including* coding time, *and* you could incorporate the changing
 nature of Twitter social graphs in a simulation.

 (Smiling at the subtle irony in my standard email signature) ;-)

 --
 M. Edward (Ed) Borasky
 borasky-research.net/m-edward-ed-borasky/

 A mathematician is a device for turning coffee into theorems. ~ Paul
 Erdos



[twitter-dev] Re: graph API call returning 'This Method requires authentication'

2009-08-29 Thread Pete Warden
A late follow-up on this, but I'm hitting the same problem:

- It's only happening with friends/ids.json, all other calls work
- Bizarrely I can call it fine from command-line curl on the same machine,
but using curl within PHP I get the error
- I've tried rejigging my curl/php code to use the in-url syntax (eg
someone:passw...@twitter... as the URL) rather than curl_setopt, with no
luck

The documentation seems to indicate this method doesn't even require
authentication, so I'm left scratching my head. If I can create a minimal
reproducible case, I'll file an issue, but for now I just want to document
this for posterity.

Pete

On Sun, Jun 7, 2009 at 7:56 AM, developerinlondon ebilliona...@gmail.comwrote:


 Hi,

 For some reason I am getting the above error message returned for a
 specific ID only:

 curl -u username:password
 http://twitter.com/friends/ids.xml?user_id=2064571515
 ?xml version=1.0 encoding=UTF-8?
 hash
  request/friends/ids.xml?user_id=2064571515/request
  errorThis method requires authentication./error
 /hash

 Any other Twitter IDs it works fine and I get a list of IDs for them.
 Eg the following works:
 curl -u username:password
 http://twitter.com/friends/ids.xml?user_id=23943320

 Any ideas what I need to be doing to fix this?
 The pattern seems to be always occuring when I try a UserID with 10
 digits.

 Thanks,

 Nayeem



[twitter-dev] Patch for permanent portrait URLs with SPIURL

2009-03-11 Thread Pete Warden
I've been using Shannon Whitley's SPIURL code for the last couple of weeks,
and it's a very simple way to create permanent links to user's portraits. In
the last few days though I started to notice it returning a lot of 404
errors for images. Digging into it, I was hitting the 100 request-per-hour
limit. Looking through the documentation this makes sense, what has me
confused is how it was working before! Was the limit not actually being
enforced for anonymous access to
http://twitter.com/users/show/screen_name.xml until recently?

I know there's some other developers using this, so here's my patch to
authenticate those API calls:
http://petewarden.typepad.com/searchbrowser/2009/03/adding-authentication-to-the-spiurl-permanent-twitter-portrait-project.html

Pete Warden


[twitter-dev] Re: one-click follow

2009-02-26 Thread Pete Warden
Well, actually there kinda-is:
http://petewarden.typepad.com/searchbrowser/2008/12/how-to-create-a-oneclick-twitter-follow-button.html

There was a hole in December that allowed the user's
twitter.comauthentication cookies to be used by another page's
Javascript. That's now
been fixed, so the technique now brings up a dialog asking for the user's
name and password when they click. A pretty confusing user experience
though, so I no longer use it.

From a UI point of view I'd prefer to have a dedicated Twitter landing page
that you could send people to that just contained a 'Do you want to follow
X?' rather than having the ubiquitous 'Go to this page and then find the
follow button' text on every source page. Just my 2 cents though. :)

Pete

On Thu, Feb 26, 2009 at 1:46 PM, Stuart stut...@gmail.com wrote:

 2009/2/26 pnoeric e...@ericmueller.org


 Hey, is there a one-click follow this user link? I'm adding social
 bookmarking features to my site and one of them is Follow us on
 Twitter. Currently I sent them to my Twitter page (http://twitter.com/
 flwbooks) and they have to click below my icon, then click following.

 I'd prefer to have them land on a page that just said Ok, you're now
 following @FLWbooks (or even a simple Do you want to follow
 @FLWbooks? yes/no page).

 It sounds minor, but every click counts, so thought I'd ask... :-)


 No there isn't, since it would be wide open to abuse. And a yes/no page
 would not reduce clicks so is rather redundant IMHO.
 -Stuart

 --
 http://stut.net/



Re: Social graph (was Re: [twitter-dev] Does this exist?)

2009-02-26 Thread Pete Warden
(Privately mailed, since I'm nervous about edging off-topic)

I'm working on some related areas, capturing conversation data from Twitter
at http://twitter.mailana.com/ . My approach has been the classic disk-space
trade off, creating massive indices to pre-cache queries. You're right
though, even with that approach the overhead of updating the denormalized
data of a complete friends-of-friends list for all users every time a link
changed would be enormous.

Pete

On Thu, Feb 26, 2009 at 4:39 PM, Nick Arnett nick.arn...@gmail.com wrote:



 On Thu, Feb 26, 2009 at 4:19 PM, Nick Arnett nick.arn...@gmail.comwrote:


 A relational database falls down very fast on this kind of analysis.  For
 example, I have more than 300 followers, which is a simple query... but it
 returns 300 users and now the query needs to ask who the followers of those
 300 are, to answer question No. 1.  That's a big, slow query, since it has
 to specify the ids of the 300 that I follow... or it is 300 smaller queries.
  Either way, ugh.  That query is going to return a very large number of
 items, many of which need to be compared with one another.


 FYI, there are 345,000 nodes and 1.4 million edges in the graph of me, my
 followers and their followers.  I'm sure this could be pared down
 considerably by eliminating a handful of extremely popular people, but it's
 still a hard problem to scale.

 Nick



Re: Frequent HTTP 500 server errors from search API

2009-02-18 Thread Pete Warden

I'm hitting the same 500 problem with complex queries. It's occasional
enough that I can live with it, but the big issue for me is that I'm
using the JSON interface and appear to get back HTML in the error
content. This doesn't allow me to do any error handling, as the
script tag tries to interpret the content as Javascript and
obviously fails before I can do anything to catch the problem.

cheers,
Pete

On Feb 18, 8:14 am, Matt Sanford m...@twitter.com wrote:
 Hi Nigel,

      The HTTP 500 and HTTP 502 are different errors. A 502 is returned  
 when there are no more mongrel processes available to handle your  
 request, the dreaded Fail Whale. We've been adding more hardware to  
 our search system to keep from having problems but we ran into a bug  
 yesterday that threw tons of 502s.

      The 500 is a different matter. From the queries you described I'm  
 guessing these are query timeouts. We've optimized the system for the  
 most common use which is 1-2 real words. Some queries take too long  
 and we kill them rather than let them back the entire system up  
 (causing 502s). When a query gets killed we return a 500. Since we've  
 had some time to cache part of the information retries work more  
 often, but since there are multiple machines that's not always the  
 case. If you can provide me some real example queries that you're  
 having trouble (m...@twitter.com) with I can confirm this but it seems  
 like the most likely reason. The most commonly killed queries are for  
 complex combinations of operators, or queries with multiple words that  
 rarely ever appear together.

      As for the firehose, we'll be updating the FAQ page 
 (http://apiwiki.twitter.com/FAQ#Whenwillthefirehosebeready
 ) as things change. We've been working on this in parallel with OAuth  
 and are working on the final touches now.

 Thanks;
    — Matt Sanford

 On Feb 18, 2009, at 06:08 AM, nigel_spl...@yahoo.com wrote:



  Hello,

  We get frequent and seemingly-random HTTP 500 errors when calling the
  search API.  Sometimes it's 500, sometimes 502.  It's definitely not
  throttling, as we don't make that many calls and I know from reading
  here that those errors have a specific message.

  I've tried without success to find particular searches, times of day,
  etc. that are more or less likely to fail.  Eventually I wrote a
  little program that generates random three-letter nonsense words and
  calls the search API with wget for each one.  Out of 100 different
  search terms, usually there are 3-10 failures.  If I try the same
  search terms again, sometimes the ones that failed will now work, and
  vice versa.  This behavior is similar to what we see in our production
  code, which is completely different but also fails randomly and fairly
  frequently.

  Any ideas?  We'd love to get a solution, since even with retry
  policies the errors are frequent enough to really hinder our ability
  to get search results.

  On a related note, is there any news on availability of the content
  firehose?  Last I remember reading, it was going to be made available
  to trusted partners at the end of January or beginning of February.
  Our ideal solution would be to have access to the firehose so we can
  index and search the content on our side.

  Thanks!


Re: Sample code for using Yahoo's geocoding to emulate near in the search API

2009-02-18 Thread Pete Warden
Thanks Chad, mostly just wanted to get something into the search indexes for
anyone else looking. :) I'd never thought of using Pipes, that is a very
neat approach, feels more elegant than requiring PHP in a lot of situations.

cheers,
 Pete

On Wed, Feb 18, 2009 at 1:30 PM, Chad Etzel jazzyc...@gmail.com wrote:


 argh, hit send by mistake.. I was going to add:

 Your sample looks great, and I may even start using it for some other
 projects where the pipe would not be as useful.  Thanks for posting
 the link, very nice.

 I wasn't trying to trump your example, merely posting another way to
 get around the non-near within sytanx availability on the API side.

 -Chad

 On Wed, Feb 18, 2009 at 4:28 PM, Chad Etzel jazzyc...@gmail.com wrote:
  I use this Y! Pipe for TweetGrid to accomplish geocoding:
 
 
 http://pipes.yahoo.com/pipes/pipe.info?_id=27c113188a1f89baab07f2d133bc3557
 
  it was lovingly copied and edited from a similar pipe by @JohnDBishop
  (with permission).
 
  I use this with a json callback (plus some regex matching) to
  translate between near: within: syntax  and geocoding.
 
  Anyone is welcome to clone/edit it for their own use.
 
  -Chad
 
  On Wed, Feb 18, 2009 at 3:17 PM, Pete Warden searchbrow...@gmail.com
 wrote:
 
  I needed a way for users to be able to enter readable place names and
  do searches restricted to the neighborhood. The search API only
  supports lat,long so I had to implement some geocoding to translate
  names into coordinates. I ended up using Yahoo's free GeoPlanet
  service, with 50,000 requests possible per month.
 
  Since I couldn't find any other public examples of how to do this
  (though I'm sure this must be in a lot of code out there) I put up my
  sample code:
 
 http://petewarden.typepad.com/searchbrowser/2009/02/how-to-emulate-near-in-the-twitter-search-api-using-geoplanet.html
 
  It's a small PHP file, and works just like the normal search API call
  but with an additional near argument that gets translated by the
  geocoding. I'd love to see some more explanation on the docs wiki of
  this sort of workaround for 'near', but it seems that it's only
  editable by Twitter employees? Facebook's more open editing policy
  seems to work well for them.
 
  cheers,
Pete