Re: OAuth middleware

Josh Levy-Kramer Mon, 24 Aug 2015 10:49:23 -0700

I have thought of a much simpler solution. If there are no more tokens left 
when 'process_request' is executed then you can simply reschedule the 
request by returning the request in the process_request method. Although 
you don't delay the request, it would simplify the logic a lot and would 
discard the need for the spider_ideal method. The new code can be found at:


https://github.com/joshlk/scrapy/blob/a534521d40aebf81a55dd69477dabc0b221e3e96/scrapy/contrib/downloadermiddleware/httpoauth.py

Here is my revised method (updated from the previous post):
* The user supplies a list of tokens via 'oauth_token_list'. The user also 
specifies 'REQUEST_WINDOW_SIZE_MINS', which is how long one has to wait 
until a token can be reused once reached it rate limit
* Two lists are created, the dead queue and the live queue. All tokens are 
at first put into the dead queue
* In 'process_request' it: 1. Tries to obtain a token from the live queue 
(which at first there are none) 2. Checks the time until the first token in 
the dead queue has left until it can be used again (at first this will be 
0). If there is a reusable dead token it uses that. If it can’t obtain a 
token from either queue it returns the request and therefore rescheduling 
it for later.
* In 'process_response': It determines if the token has died or not, using 
'check_response'. If the token has died the token is placed in the dead 
queue if not its placed in the live queue.

Again, any thoughts or suggestions are much welcomed. Thanks,
Josh


On Friday, August 21, 2015 at 10:32:39 AM UTC+1, Josh Levy-Kramer wrote:
>
> I’ve merged aspects of my code with your OAuth class. I create a very 
> rough draft and so ignored the case when a user defines their own 
> 'oauth_client' and only modified the OAuth1 middleware. The folk can be 
> seen here:
>
> https://github.com/joshlk/scrapy/blob/oauth-multi-token-draft/scrapy/contrib/downloadermiddleware/httpoauth.py
>
> Here are some notes: --
>
> The problem: You have a pool of tokens but you are rate limited on a per 
> token basis. You therefore want to use the pool’s maximums throughput by 
> cycling through the tokens.
>
> Method:
> * The user supplies a list of tokens via 'oauth_token_list'. The user also 
> specifies 'REQUEST_WINDOW_SIZE_MINS', which is how long one has to wait 
> until a token can be reused once reached it rate limit
> * Two lists are created, the dead queue and the live queue. All tokens are 
> at first put into the dead queue
> * In 'process_request' it: 1. Tries to obtain a token from the live queue 
> (which at first there are none) 2. Checks the time until the first token in 
> the dead queue has left until it can be used again (at first this will be 
> 0). If there is a reusable dead token it uses that. If it can’t obtain a 
> token from either queue it places the request into a 'request_delayed' 
> list, to be processed later.
> * In 'process_response': It determines if the token has died or not, using 
> 'check_response'. If the token has died the token is placed in the dead 
> queue if not its placed in the live queue.
> * The 'spider_ideal' method is called when the spider is ideal. This can 
> occur when no tokens are currently available but requests exist in the 
> 'request_delayed' list. When this occurs the method sleeps until tokens 
> become available and then reschedules the requests
>
> Notes:
> * There are three places a token can be: the dead queue, live queue or 
> attached to a request
> * Determining if token has 'died' I believe is domain specific (is there a 
> standard way in the OAuth spec?). Therefore the 'check_response' function 
> can be defined by the user. The default is to always amuse the token died 
> after one response. If REQUEST_WINDOW_SIZE_MINS is also set to 0 (default) 
> then the token pool will continually be recycled
> * 'check_response' also determines if a request failed due to the token. 
> If it has failed the request is rescheduled.
> * Placing requests into a 'request_delayed' list and then invoking the 
> 'spider_ideal' method is quite convoluted. A much easier way would be just 
> too simply delay the request for X mins until you know tokens will become 
> available. I however couldn't determine a way to do this. Any ideas?
>
> All comments will be much welcomed. Thanks,
> Josh
>
>
>
>
> On Thursday, August 20, 2015 at 9:38:06 AM UTC+1, Juan Riaza wrote:
>>
>> It needs some proper tests plus docs. Hopefully I'll get time to do so 
>> early next month. I would like to check that code to cycle through a pool 
>> of tokens it seems like a proper use case.
>>
>> On Wednesday, August 19, 2015 at 5:03:43 PM UTC+2, Josh Levy-Kramer wrote:
>>>
>>> Wow this looks great. I wish I found this a month ago! What barriers are 
>>> there in getting this into the main scrapy code base?
>>>
>>> Additionally, I would say a common case with OAuth is that the user has 
>>> access to a pool of tokens to use and cycle through them because APIs 
>>> usually restrict the number of call that can be made per token. I have 
>>> written code that deals with this usecase (although not very gracefully). 
>>> Would you be interested in incorporating this logic into the OAuth 
>>> middleware?
>>>
>>> On Wednesday, August 19, 2015 at 3:23:37 PM UTC+1, Juan Riaza wrote:
>>>>
>>>> Hi Josh,
>>>>
>>>> Nice to hear about that middleware. I worked some time ago on a draft 
>>>> implementation here: 
>>>> https://github.com/juanriaza/scrapy/commits/oauth-draft It would be 
>>>> awesome to check your middleware and give this a final push.
>>>>
>>>> On Wednesday, August 19, 2015 at 3:42:43 PM UTC+2, Josh Levy-Kramer 
>>>> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> I have written an OAuth middleware module so Scrapy can access website 
>>>>> or APIs which require OAuth authentication. Parts of the module are quite 
>>>>> specific with the sites I have been working with (namely Twitter). Would 
>>>>> anyone be interested in such a module? My knowledge of the OAuth protocol 
>>>>> is rather limited and I would be interested in generalising the module 
>>>>> for 
>>>>> other websites that use the OAuth protocol. Does anyone have the 
>>>>> expertise 
>>>>> to this?
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: OAuth middleware

Reply via email to