[twitter-dev] Re: Possible Bug in Twitter Search API

2009-05-15 Thread Matt Sanford


Hi Brian,

My guess is that this is the same since_id/max_id pagination  
confusion we have always had. If you look at the next_page URL in our  
API you'll notice that it does not contain the since_id. If you are  
searching with since_id and requesting multiple pages you need to  
manually stop pagination once you find an id lower than your original  
since_id. I know this is a pain but there is a large performance gain  
in it on our back end. There was an update a few weeks ago [1] where I  
talked about this and a warning message (twitter:warning in atom and  
warning in JSON) was added to alert you to the fact it had been  
removed. Does that sound like the cause of your issue?


Thanks;
 – Matt Sanford / @mzsanford
 Twitter Dev

[1] - 
http://groups.google.com/group/twitter-development-talk/browse_frm/thread/6e80cb6eec3a16d3?tvc=1

On May 15, 2009, at 7:50 AM, briantroy wrote:



I've noticed this before but always tried to deal with it as a bug on
my side. It is, however, now clear to me that from time to time
Twitter Search API seems to ignore the since_id.

We track FollowFriday by polling Twitter Search every so often (the
process is throttled from 10 seconds to 180 seconds depending on how
many results we get). This works great 90% of the time. But on high
volume days (Fridays) I've noticed we get a lot of multi-page
responses causing us to make far too many requests to the Twitter API
(900/hour).
When attempting to figure out why we are making so many requests I
uncovered something very interesting. When we get a tweet we store
it in our database. That database has a unique index on the customer
id/Tweet Id. When we get mulit-page responses from Twitter and iterate
through each page the VAST MAJORITY of the Tweets violate this unique
index. What does this mean? That we already have that tweet.
Today, I turned on some additional debugging and saw that the tweets
we were getting from Twitter Search were, in fact, prior to the
since_id we sent.

This is causing us to POUND the API servers unnecessarily. There is,
however, really nothing I can do about it on my end.

Here is a snip of the log showing the failed inserts and the ID we are
working with. The last line shows you both the old max id and the new
max id (after processing the tweets). As you can see every tweet
violates the unique constraint (27 is the customer id). You can also
see that we've called the API for this one search 1016 times this
hour... which is WAY, WAY too much (16.9 times per second):

NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
entry '27-1806522797' for key 2
SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
from_user_id, from_user, iso_language_code, profile_image_url,
created_at, bulk_svc_id) values('#bfollowfriday/b edubloggers
@CoolCatTeacher @dwarlick @ewanmcintosh @willrich45 @larryferlazzo
@suewaters',1806522797, 0, '', 192010, 'WeAreTeachers', 'en', 'http://
s3.amazonaws.com/twitter_production/profile_images/52716611/
Picture_2_normal.png', 'Fri, 15 May 2009 14:41:51 +', 27)
NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
entry '27-1806522766' for key 2
SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
from_user_id, from_user, iso_language_code, profile_image_url,
created_at, bulk_svc_id) values('thx for the #bfollowfriday/b
love, @brokesocialite amp; @silveroaklimo.  Also thx to @diamondemory
amp; @bmichelle for the RTs of FF',1806522766, 0, '', 1149953,
'lmdupont', 'en', 'http://s3.amazonaws.com/twitter_production/
profile_images/188591402/lisaann_normal.jpg', 'Fri, 15 May 2009
14:41:51 +', 27)
NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
entry '27-1806522760' for key 2
SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
from_user_id, from_user, iso_language_code, profile_image_url,
created_at, bulk_svc_id) values('Thx! RT @dpbkmb: #bfollowfriday/b
@ifeelgod @americandream09 @DailyHappenings @MrMilestone @emgtay
@Nurul54 @mexiabill @naturallyknotty',1806522760, 0, '', 1303322,
'borgellaj', 'en', 'http://s3.amazonaws.com/twitter_production/
profile_images/58399480/img017_normal.jpg', 'Fri, 15 May 2009 14:41:51
+', 27)
NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
entry '27-1806522759' for key 2
SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
from_user_id, from_user, iso_language_code, profile_image_url,
created_at, bulk_svc_id) values('Morning my tweets!!! bfollow
friday/b! Dnt forget to RT me in need of followers LOL!',1806522759,
0, '', 11790458, 'Dae_Marie', 'en', 'http://s3.amazonaws.com/
twitter_production/profile_images/199283178/dae_bab_normal.jpg',
'Fri, 15 May 2009 14:41:50 +', 27)
NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
entry '27-1806522752' for key 2
SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
from_user_id, from_user, iso_language_code, profile_image_url,

[twitter-dev] Re: Possible Bug in Twitter Search API

2009-05-15 Thread briantroy

Matt - I'll verify that is the issue (I assume I should have new
results on page one AND page 2 - otherwise there is something else
going on).

Brian

On May 15, 8:33 am, Matt Sanford m...@twitter.com wrote:
 Hi Brian,

      My guess is that this is the same since_id/max_id pagination  
 confusion we have always had. If you look at the next_page URL in our  
 API you'll notice that it does not contain the since_id. If you are  
 searching with since_id and requesting multiple pages you need to  
 manually stop pagination once you find an id lower than your original  
 since_id. I know this is a pain but there is a large performance gain  
 in it on our back end. There was an update a few weeks ago [1] where I  
 talked about this and a warning message (twitter:warning in atom and  
 warning in JSON) was added to alert you to the fact it had been  
 removed. Does that sound like the cause of your issue?

 Thanks;
   – Matt Sanford / @mzsanford
       Twitter Dev

 [1] -http://groups.google.com/group/twitter-development-talk/browse_frm/th...

 On May 15, 2009, at 7:50 AM, briantroy wrote:



  I've noticed this before but always tried to deal with it as a bug on
  my side. It is, however, now clear to me that from time to time
  Twitter Search API seems to ignore the since_id.

  We track FollowFriday by polling Twitter Search every so often (the
  process is throttled from 10 seconds to 180 seconds depending on how
  many results we get). This works great 90% of the time. But on high
  volume days (Fridays) I've noticed we get a lot of multi-page
  responses causing us to make far too many requests to the Twitter API
  (900/hour).
  When attempting to figure out why we are making so many requests I
  uncovered something very interesting. When we get a tweet we store
  it in our database. That database has a unique index on the customer
  id/Tweet Id. When we get mulit-page responses from Twitter and iterate
  through each page the VAST MAJORITY of the Tweets violate this unique
  index. What does this mean? That we already have that tweet.
  Today, I turned on some additional debugging and saw that the tweets
  we were getting from Twitter Search were, in fact, prior to the
  since_id we sent.

  This is causing us to POUND the API servers unnecessarily. There is,
  however, really nothing I can do about it on my end.

  Here is a snip of the log showing the failed inserts and the ID we are
  working with. The last line shows you both the old max id and the new
  max id (after processing the tweets). As you can see every tweet
  violates the unique constraint (27 is the customer id). You can also
  see that we've called the API for this one search 1016 times this
  hour... which is WAY, WAY too much (16.9 times per second):

  NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
  entry '27-1806522797' for key 2
  SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
  from_user_id, from_user, iso_language_code, profile_image_url,
  created_at, bulk_svc_id) values('#bfollowfriday/b edubloggers
  @CoolCatTeacher @dwarlick @ewanmcintosh @willrich45 @larryferlazzo
  @suewaters',1806522797, 0, '', 192010, 'WeAreTeachers', 'en', 'http://
  s3.amazonaws.com/twitter_production/profile_images/52716611/
  Picture_2_normal.png', 'Fri, 15 May 2009 14:41:51 +', 27)
  NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
  entry '27-1806522766' for key 2
  SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
  from_user_id, from_user, iso_language_code, profile_image_url,
  created_at, bulk_svc_id) values('thx for the #bfollowfriday/b
  love, @brokesocialite amp; @silveroaklimo.  Also thx to @diamondemory
  amp; @bmichelle for the RTs of FF',1806522766, 0, '', 1149953,
  'lmdupont', 'en', 'http://s3.amazonaws.com/twitter_production/
  profile_images/188591402/lisaann_normal.jpg', 'Fri, 15 May 2009
  14:41:51 +', 27)
  NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
  entry '27-1806522760' for key 2
  SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
  from_user_id, from_user, iso_language_code, profile_image_url,
  created_at, bulk_svc_id) values('Thx! RT @dpbkmb: #bfollowfriday/b
  @ifeelgod @americandream09 @DailyHappenings @MrMilestone @emgtay
  @Nurul54 @mexiabill @naturallyknotty',1806522760, 0, '', 1303322,
  'borgellaj', 'en', 'http://s3.amazonaws.com/twitter_production/
  profile_images/58399480/img017_normal.jpg', 'Fri, 15 May 2009 14:41:51
  +', 27)
  NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
  entry '27-1806522759' for key 2
  SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
  from_user_id, from_user, iso_language_code, profile_image_url,
  created_at, bulk_svc_id) values('Morning my tweets!!! bfollow
  friday/b! Dnt forget to RT me in need of followers LOL!',1806522759,
  0, '', 11790458, 'Dae_Marie', 'en', 'http://s3.amazonaws.com/
  

[twitter-dev] Re: Possible Bug in Twitter Search API

2009-05-15 Thread briantroy

Matt -

That took care of it... minor change on my side with big resource
savings. Where was the original announcement made that this had
changed (wondering how I missed it).

Thanks!

Brian

On May 15, 8:33 am, Matt Sanford m...@twitter.com wrote:
 Hi Brian,

      My guess is that this is the same since_id/max_id pagination  
 confusion we have always had. If you look at the next_page URL in our  
 API you'll notice that it does not contain the since_id. If you are  
 searching with since_id and requesting multiple pages you need to  
 manually stop pagination once you find an id lower than your original  
 since_id. I know this is a pain but there is a large performance gain  
 in it on our back end. There was an update a few weeks ago [1] where I  
 talked about this and a warning message (twitter:warning in atom and  
 warning in JSON) was added to alert you to the fact it had been  
 removed. Does that sound like the cause of your issue?

 Thanks;
   – Matt Sanford / @mzsanford
       Twitter Dev

 [1] -http://groups.google.com/group/twitter-development-talk/browse_frm/th...

 On May 15, 2009, at 7:50 AM, briantroy wrote:



  I've noticed this before but always tried to deal with it as a bug on
  my side. It is, however, now clear to me that from time to time
  Twitter Search API seems to ignore the since_id.

  We track FollowFriday by polling Twitter Search every so often (the
  process is throttled from 10 seconds to 180 seconds depending on how
  many results we get). This works great 90% of the time. But on high
  volume days (Fridays) I've noticed we get a lot of multi-page
  responses causing us to make far too many requests to the Twitter API
  (900/hour).
  When attempting to figure out why we are making so many requests I
  uncovered something very interesting. When we get a tweet we store
  it in our database. That database has a unique index on the customer
  id/Tweet Id. When we get mulit-page responses from Twitter and iterate
  through each page the VAST MAJORITY of the Tweets violate this unique
  index. What does this mean? That we already have that tweet.
  Today, I turned on some additional debugging and saw that the tweets
  we were getting from Twitter Search were, in fact, prior to the
  since_id we sent.

  This is causing us to POUND the API servers unnecessarily. There is,
  however, really nothing I can do about it on my end.

  Here is a snip of the log showing the failed inserts and the ID we are
  working with. The last line shows you both the old max id and the new
  max id (after processing the tweets). As you can see every tweet
  violates the unique constraint (27 is the customer id). You can also
  see that we've called the API for this one search 1016 times this
  hour... which is WAY, WAY too much (16.9 times per second):

  NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
  entry '27-1806522797' for key 2
  SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
  from_user_id, from_user, iso_language_code, profile_image_url,
  created_at, bulk_svc_id) values('#bfollowfriday/b edubloggers
  @CoolCatTeacher @dwarlick @ewanmcintosh @willrich45 @larryferlazzo
  @suewaters',1806522797, 0, '', 192010, 'WeAreTeachers', 'en', 'http://
  s3.amazonaws.com/twitter_production/profile_images/52716611/
  Picture_2_normal.png', 'Fri, 15 May 2009 14:41:51 +', 27)
  NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
  entry '27-1806522766' for key 2
  SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
  from_user_id, from_user, iso_language_code, profile_image_url,
  created_at, bulk_svc_id) values('thx for the #bfollowfriday/b
  love, @brokesocialite amp; @silveroaklimo.  Also thx to @diamondemory
  amp; @bmichelle for the RTs of FF',1806522766, 0, '', 1149953,
  'lmdupont', 'en', 'http://s3.amazonaws.com/twitter_production/
  profile_images/188591402/lisaann_normal.jpg', 'Fri, 15 May 2009
  14:41:51 +', 27)
  NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
  entry '27-1806522760' for key 2
  SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
  from_user_id, from_user, iso_language_code, profile_image_url,
  created_at, bulk_svc_id) values('Thx! RT @dpbkmb: #bfollowfriday/b
  @ifeelgod @americandream09 @DailyHappenings @MrMilestone @emgtay
  @Nurul54 @mexiabill @naturallyknotty',1806522760, 0, '', 1303322,
  'borgellaj', 'en', 'http://s3.amazonaws.com/twitter_production/
  profile_images/58399480/img017_normal.jpg', 'Fri, 15 May 2009 14:41:51
  +', 27)
  NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
  entry '27-1806522759' for key 2
  SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
  from_user_id, from_user, iso_language_code, profile_image_url,
  created_at, bulk_svc_id) values('Morning my tweets!!! bfollow
  friday/b! Dnt forget to RT me in need of followers LOL!',1806522759,
  0, '', 11790458, 'Dae_Marie', 'en', 

[twitter-dev] Re: Possible Bug in Twitter Search API

2009-05-15 Thread Matt Sanford


Hi Brian,

This has always been the case, that thread I linked to earlier is  
where I made it more explicit. It was always there but it wasn't  
documented properly. The documentation was updated as well to try and  
help in the future.


Thanks;
 – Matt Sanford / @mzsanford
 Twitter Dev

On May 15, 2009, at 9:14 AM, briantroy wrote:



Matt -

That took care of it... minor change on my side with big resource
savings. Where was the original announcement made that this had
changed (wondering how I missed it).

Thanks!

Brian

On May 15, 8:33 am, Matt Sanford m...@twitter.com wrote:

Hi Brian,

 My guess is that this is the same since_id/max_id pagination
confusion we have always had. If you look at the next_page URL in our
API you'll notice that it does not contain the since_id. If you are
searching with since_id and requesting multiple pages you need to
manually stop pagination once you find an id lower than your original
since_id. I know this is a pain but there is a large performance gain
in it on our back end. There was an update a few weeks ago [1]  
where I

talked about this and a warning message (twitter:warning in atom and
warning in JSON) was added to alert you to the fact it had been
removed. Does that sound like the cause of your issue?

Thanks;
  – Matt Sanford / @mzsanford
  Twitter Dev

[1] -http://groups.google.com/group/twitter-development-talk/browse_frm/th 
...


On May 15, 2009, at 7:50 AM, briantroy wrote:



I've noticed this before but always tried to deal with it as a bug  
on

my side. It is, however, now clear to me that from time to time
Twitter Search API seems to ignore the since_id.



We track FollowFriday by polling Twitter Search every so often (the
process is throttled from 10 seconds to 180 seconds depending on how
many results we get). This works great 90% of the time. But on high
volume days (Fridays) I've noticed we get a lot of multi-page
responses causing us to make far too many requests to the Twitter  
API

(900/hour).
When attempting to figure out why we are making so many requests I
uncovered something very interesting. When we get a tweet we store
it in our database. That database has a unique index on the customer
id/Tweet Id. When we get mulit-page responses from Twitter and  
iterate
through each page the VAST MAJORITY of the Tweets violate this  
unique

index. What does this mean? That we already have that tweet.
Today, I turned on some additional debugging and saw that the tweets
we were getting from Twitter Search were, in fact, prior to the
since_id we sent.



This is causing us to POUND the API servers unnecessarily. There is,
however, really nothing I can do about it on my end.


Here is a snip of the log showing the failed inserts and the ID we  
are
working with. The last line shows you both the old max id and the  
new

max id (after processing the tweets). As you can see every tweet
violates the unique constraint (27 is the customer id). You can also
see that we've called the API for this one search 1016 times this
hour... which is WAY, WAY too much (16.9 times per second):



NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
entry '27-1806522797' for key 2
SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
from_user_id, from_user, iso_language_code, profile_image_url,
created_at, bulk_svc_id) values('#bfollowfriday/b edubloggers
@CoolCatTeacher @dwarlick @ewanmcintosh @willrich45 @larryferlazzo
@suewaters',1806522797, 0, '', 192010, 'WeAreTeachers', 'en',  
'http://

s3.amazonaws.com/twitter_production/profile_images/52716611/
Picture_2_normal.png', 'Fri, 15 May 2009 14:41:51 +', 27)
NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
entry '27-1806522766' for key 2
SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
from_user_id, from_user, iso_language_code, profile_image_url,
created_at, bulk_svc_id) values('thx for the #bfollowfriday/b
love, @brokesocialite amp; @silveroaklimo.  Also thx to  
@diamondemory

amp; @bmichelle for the RTs of FF',1806522766, 0, '', 1149953,
'lmdupont', 'en', 'http://s3.amazonaws.com/twitter_production/
profile_images/188591402/lisaann_normal.jpg', 'Fri, 15 May 2009
14:41:51 +', 27)
NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
entry '27-1806522760' for key 2
SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,
from_user_id, from_user, iso_language_code, profile_image_url,
created_at, bulk_svc_id) values('Thx! RT @dpbkmb:  
#bfollowfriday/b

@ifeelgod @americandream09 @DailyHappenings @MrMilestone @emgtay
@Nurul54 @mexiabill @naturallyknotty',1806522760, 0, '', 1303322,
'borgellaj', 'en', 'http://s3.amazonaws.com/twitter_production/
profile_images/58399480/img017_normal.jpg', 'Fri, 15 May 2009  
14:41:51

+', 27)
NOTICE: 10:45:37 AM on Fri May 15th Tweet insert failed: Duplicate
entry '27-1806522759' for key 2
SQL: insert into justsignal.tweets(text, tw_id, to_user_id, to_user,