[twitter-dev] Academic data release
I'm looking into releasing a data set based on information pulled from the Twitter API. It would be a free release limited to academic researchers, an anonymized version of the network connections of several million users with public profiles. What I'm hoping to release is something like this: user id, city-level location, follower ids, friend ids In all cases, the ids are arbitrary identifiers that are not convertible to actual Twitter ids, and any detailed locations are converted to the nearest large city. I'm aware that it may be possible to de-anonymize some of these users based on topology, but since much richer information is available through the API on these users anyway, that seems unlikely to be an issue? However I'm obviously keen to hear any concerns that Twitter (or other developers here) may have before I go forward with this. cheers, Pete
Re: [twitter-dev] Academic data release
The value lies in the particular properties of a real social graph, as opposed to an artificially generated one. The sort of questions it's useful for are primarily social rather than mathematical. For a summary of some existing research on similar data sets, see: http://petewarden.typepad.com/searchbrowser/2010/02/social-network-data-and-research.html On Wed, Feb 24, 2010 at 11:18 AM, M. Edward (Ed) Borasky zn...@cesmail.netwrote: Quoting Pete Warden p...@petewarden.com: I'm looking into releasing a data set based on information pulled from the Twitter API. It would be a free release limited to academic researchers, an anonymized version of the network connections of several million users with public profiles. What I'm hoping to release is something like this: user id, city-level location, follower ids, friend ids In all cases, the ids are arbitrary identifiers that are not convertible to actual Twitter ids, and any detailed locations are converted to the nearest large city. I'm aware that it may be possible to de-anonymize some of these users based on topology, but since much richer information is available through the API on these users anyway, that seems unlikely to be an issue? However I'm obviously keen to hear any concerns that Twitter (or other developers here) may have before I go forward with this. What is the value of such a dataset to an academic researcher? I consider myself an academic researcher, though I don't have a formal position as one. What can you do with a real Twitter social graph that you can't do with one generated by random techniques based on statistical sampling of Twitter data? A million-user real social graph, even assuming fewer than 5,000 friend_ids and follower_ids per user, costs two million API calls. At 350 calls per hour, that works out to 238 days by my calculation. And during that 238 days, the social graph is changing many times a second. A randomly-generated graph of a much larger size could be constructed in a day, *including* coding time, *and* you could incorporate the changing nature of Twitter social graphs in a simulation. (Smiling at the subtle irony in my standard email signature) ;-) -- M. Edward (Ed) Borasky borasky-research.net/m-edward-ed-borasky/ A mathematician is a device for turning coffee into theorems. ~ Paul Erdos
[twitter-dev] Re: graph API call returning 'This Method requires authentication'
A late follow-up on this, but I'm hitting the same problem: - It's only happening with friends/ids.json, all other calls work - Bizarrely I can call it fine from command-line curl on the same machine, but using curl within PHP I get the error - I've tried rejigging my curl/php code to use the in-url syntax (eg someone:passw...@twitter... as the URL) rather than curl_setopt, with no luck The documentation seems to indicate this method doesn't even require authentication, so I'm left scratching my head. If I can create a minimal reproducible case, I'll file an issue, but for now I just want to document this for posterity. Pete On Sun, Jun 7, 2009 at 7:56 AM, developerinlondon ebilliona...@gmail.comwrote: Hi, For some reason I am getting the above error message returned for a specific ID only: curl -u username:password http://twitter.com/friends/ids.xml?user_id=2064571515 ?xml version=1.0 encoding=UTF-8? hash request/friends/ids.xml?user_id=2064571515/request errorThis method requires authentication./error /hash Any other Twitter IDs it works fine and I get a list of IDs for them. Eg the following works: curl -u username:password http://twitter.com/friends/ids.xml?user_id=23943320 Any ideas what I need to be doing to fix this? The pattern seems to be always occuring when I try a UserID with 10 digits. Thanks, Nayeem
[twitter-dev] Patch for permanent portrait URLs with SPIURL
I've been using Shannon Whitley's SPIURL code for the last couple of weeks, and it's a very simple way to create permanent links to user's portraits. In the last few days though I started to notice it returning a lot of 404 errors for images. Digging into it, I was hitting the 100 request-per-hour limit. Looking through the documentation this makes sense, what has me confused is how it was working before! Was the limit not actually being enforced for anonymous access to http://twitter.com/users/show/screen_name.xml until recently? I know there's some other developers using this, so here's my patch to authenticate those API calls: http://petewarden.typepad.com/searchbrowser/2009/03/adding-authentication-to-the-spiurl-permanent-twitter-portrait-project.html Pete Warden
[twitter-dev] Re: one-click follow
Well, actually there kinda-is: http://petewarden.typepad.com/searchbrowser/2008/12/how-to-create-a-oneclick-twitter-follow-button.html There was a hole in December that allowed the user's twitter.comauthentication cookies to be used by another page's Javascript. That's now been fixed, so the technique now brings up a dialog asking for the user's name and password when they click. A pretty confusing user experience though, so I no longer use it. From a UI point of view I'd prefer to have a dedicated Twitter landing page that you could send people to that just contained a 'Do you want to follow X?' rather than having the ubiquitous 'Go to this page and then find the follow button' text on every source page. Just my 2 cents though. :) Pete On Thu, Feb 26, 2009 at 1:46 PM, Stuart stut...@gmail.com wrote: 2009/2/26 pnoeric e...@ericmueller.org Hey, is there a one-click follow this user link? I'm adding social bookmarking features to my site and one of them is Follow us on Twitter. Currently I sent them to my Twitter page (http://twitter.com/ flwbooks) and they have to click below my icon, then click following. I'd prefer to have them land on a page that just said Ok, you're now following @FLWbooks (or even a simple Do you want to follow @FLWbooks? yes/no page). It sounds minor, but every click counts, so thought I'd ask... :-) No there isn't, since it would be wide open to abuse. And a yes/no page would not reduce clicks so is rather redundant IMHO. -Stuart -- http://stut.net/
Re: Social graph (was Re: [twitter-dev] Does this exist?)
(Privately mailed, since I'm nervous about edging off-topic) I'm working on some related areas, capturing conversation data from Twitter at http://twitter.mailana.com/ . My approach has been the classic disk-space trade off, creating massive indices to pre-cache queries. You're right though, even with that approach the overhead of updating the denormalized data of a complete friends-of-friends list for all users every time a link changed would be enormous. Pete On Thu, Feb 26, 2009 at 4:39 PM, Nick Arnett nick.arn...@gmail.com wrote: On Thu, Feb 26, 2009 at 4:19 PM, Nick Arnett nick.arn...@gmail.comwrote: A relational database falls down very fast on this kind of analysis. For example, I have more than 300 followers, which is a simple query... but it returns 300 users and now the query needs to ask who the followers of those 300 are, to answer question No. 1. That's a big, slow query, since it has to specify the ids of the 300 that I follow... or it is 300 smaller queries. Either way, ugh. That query is going to return a very large number of items, many of which need to be compared with one another. FYI, there are 345,000 nodes and 1.4 million edges in the graph of me, my followers and their followers. I'm sure this could be pared down considerably by eliminating a handful of extremely popular people, but it's still a hard problem to scale. Nick
Re: Frequent HTTP 500 server errors from search API
I'm hitting the same 500 problem with complex queries. It's occasional enough that I can live with it, but the big issue for me is that I'm using the JSON interface and appear to get back HTML in the error content. This doesn't allow me to do any error handling, as the script tag tries to interpret the content as Javascript and obviously fails before I can do anything to catch the problem. cheers, Pete On Feb 18, 8:14 am, Matt Sanford m...@twitter.com wrote: Hi Nigel, The HTTP 500 and HTTP 502 are different errors. A 502 is returned when there are no more mongrel processes available to handle your request, the dreaded Fail Whale. We've been adding more hardware to our search system to keep from having problems but we ran into a bug yesterday that threw tons of 502s. The 500 is a different matter. From the queries you described I'm guessing these are query timeouts. We've optimized the system for the most common use which is 1-2 real words. Some queries take too long and we kill them rather than let them back the entire system up (causing 502s). When a query gets killed we return a 500. Since we've had some time to cache part of the information retries work more often, but since there are multiple machines that's not always the case. If you can provide me some real example queries that you're having trouble (m...@twitter.com) with I can confirm this but it seems like the most likely reason. The most commonly killed queries are for complex combinations of operators, or queries with multiple words that rarely ever appear together. As for the firehose, we'll be updating the FAQ page (http://apiwiki.twitter.com/FAQ#Whenwillthefirehosebeready ) as things change. We've been working on this in parallel with OAuth and are working on the final touches now. Thanks; — Matt Sanford On Feb 18, 2009, at 06:08 AM, nigel_spl...@yahoo.com wrote: Hello, We get frequent and seemingly-random HTTP 500 errors when calling the search API. Sometimes it's 500, sometimes 502. It's definitely not throttling, as we don't make that many calls and I know from reading here that those errors have a specific message. I've tried without success to find particular searches, times of day, etc. that are more or less likely to fail. Eventually I wrote a little program that generates random three-letter nonsense words and calls the search API with wget for each one. Out of 100 different search terms, usually there are 3-10 failures. If I try the same search terms again, sometimes the ones that failed will now work, and vice versa. This behavior is similar to what we see in our production code, which is completely different but also fails randomly and fairly frequently. Any ideas? We'd love to get a solution, since even with retry policies the errors are frequent enough to really hinder our ability to get search results. On a related note, is there any news on availability of the content firehose? Last I remember reading, it was going to be made available to trusted partners at the end of January or beginning of February. Our ideal solution would be to have access to the firehose so we can index and search the content on our side. Thanks!
Re: Sample code for using Yahoo's geocoding to emulate near in the search API
Thanks Chad, mostly just wanted to get something into the search indexes for anyone else looking. :) I'd never thought of using Pipes, that is a very neat approach, feels more elegant than requiring PHP in a lot of situations. cheers, Pete On Wed, Feb 18, 2009 at 1:30 PM, Chad Etzel jazzyc...@gmail.com wrote: argh, hit send by mistake.. I was going to add: Your sample looks great, and I may even start using it for some other projects where the pipe would not be as useful. Thanks for posting the link, very nice. I wasn't trying to trump your example, merely posting another way to get around the non-near within sytanx availability on the API side. -Chad On Wed, Feb 18, 2009 at 4:28 PM, Chad Etzel jazzyc...@gmail.com wrote: I use this Y! Pipe for TweetGrid to accomplish geocoding: http://pipes.yahoo.com/pipes/pipe.info?_id=27c113188a1f89baab07f2d133bc3557 it was lovingly copied and edited from a similar pipe by @JohnDBishop (with permission). I use this with a json callback (plus some regex matching) to translate between near: within: syntax and geocoding. Anyone is welcome to clone/edit it for their own use. -Chad On Wed, Feb 18, 2009 at 3:17 PM, Pete Warden searchbrow...@gmail.com wrote: I needed a way for users to be able to enter readable place names and do searches restricted to the neighborhood. The search API only supports lat,long so I had to implement some geocoding to translate names into coordinates. I ended up using Yahoo's free GeoPlanet service, with 50,000 requests possible per month. Since I couldn't find any other public examples of how to do this (though I'm sure this must be in a lot of code out there) I put up my sample code: http://petewarden.typepad.com/searchbrowser/2009/02/how-to-emulate-near-in-the-twitter-search-api-using-geoplanet.html It's a small PHP file, and works just like the normal search API call but with an additional near argument that gets translated by the geocoding. I'd love to see some more explanation on the docs wiki of this sort of workaround for 'near', but it seems that it's only editable by Twitter employees? Facebook's more open editing policy seems to work well for them. cheers, Pete