Re: OAuth middleware

Josh Levy-Kramer Fri, 21 Aug 2015 02:33:12 -0700

I’ve merged aspects of my code with your OAuth class. I create a very rough 
draft and so ignored the case when a user defines their own 'oauth_client' 
and only modified the OAuth1 middleware. The folk can be seen here:
https://github.com/joshlk/scrapy/blob/oauth-multi-token-draft/scrapy/contrib/downloadermiddleware/httpoauth.py

Here are some notes: --

The problem: You have a pool of tokens but you are rate limited on a per 
token basis. You therefore want to use the pool’s maximums throughput by 
cycling through the tokens.

Method:
* The user supplies a list of tokens via 'oauth_token_list'. The user also 
specifies 'REQUEST_WINDOW_SIZE_MINS', which is how long one has to wait 
until a token can be reused once reached it rate limit
* Two lists are created, the dead queue and the live queue. All tokens are 
at first put into the dead queue
* In 'process_request' it: 1. Tries to obtain a token from the live queue 
(which at first there are none) 2. Checks the time until the first token in 
the dead queue has left until it can be used again (at first this will be 
0). If there is a reusable dead token it uses that. If it can’t obtain a 
token from either queue it places the request into a 'request_delayed' 
list, to be processed later.
* In 'process_response': It determines if the token has died or not, using 
'check_response'. If the token has died the token is placed in the dead 
queue if not its placed in the live queue.
* The 'spider_ideal' method is called when the spider is ideal. This can 
occur when no tokens are currently available but requests exist in the 
'request_delayed' list. When this occurs the method sleeps until tokens 
become available and then reschedules the requests

Notes:
* There are three places a token can be: the dead queue, live queue or 
attached to a request
* Determining if token has 'died' I believe is domain specific (is there a 
standard way in the OAuth spec?). Therefore the 'check_response' function 
can be defined by the user. The default is to always amuse the token died 
after one response. If REQUEST_WINDOW_SIZE_MINS is also set to 0 (default) 
then the token pool will continually be recycled
* 'check_response' also determines if a request failed due to the token. If 
it has failed the request is rescheduled.
* Placing requests into a 'request_delayed' list and then invoking the 
'spider_ideal' method is quite convoluted. A much easier way would be just 
too simply delay the request for X mins until you know tokens will become 
available. I however couldn't determine a way to do this. Any ideas?

All comments will be much welcomed. Thanks,
Josh

On Thursday, August 20, 2015 at 9:38:06 AM UTC+1, Juan Riaza wrote:
>
> It needs some proper tests plus docs. Hopefully I'll get time to do so 
> early next month. I would like to check that code to cycle through a pool 
> of tokens it seems like a proper use case.
>
> On Wednesday, August 19, 2015 at 5:03:43 PM UTC+2, Josh Levy-Kramer wrote:
>>
>> Wow this looks great. I wish I found this a month ago! What barriers are 
>> there in getting this into the main scrapy code base?
>>
>> Additionally, I would say a common case with OAuth is that the user has 
>> access to a pool of tokens to use and cycle through them because APIs 
>> usually restrict the number of call that can be made per token. I have 
>> written code that deals with this usecase (although not very gracefully). 
>> Would you be interested in incorporating this logic into the OAuth 
>> middleware?
>>
>> On Wednesday, August 19, 2015 at 3:23:37 PM UTC+1, Juan Riaza wrote:
>>>
>>> Hi Josh,
>>>
>>> Nice to hear about that middleware. I worked some time ago on a draft 
>>> implementation here: 
>>> https://github.com/juanriaza/scrapy/commits/oauth-draft It would be 
>>> awesome to check your middleware and give this a final push.
>>>
>>> On Wednesday, August 19, 2015 at 3:42:43 PM UTC+2, Josh Levy-Kramer 
>>> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> I have written an OAuth middleware module so Scrapy can access website 
>>>> or APIs which require OAuth authentication. Parts of the module are quite 
>>>> specific with the sites I have been working with (namely Twitter). Would 
>>>> anyone be interested in such a module? My knowledge of the OAuth protocol 
>>>> is rather limited and I would be interested in generalising the module for 
>>>> other websites that use the OAuth protocol. Does anyone have the expertise 
>>>> to this?
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: OAuth middleware

Reply via email to