[google-appengine] Re: Parallel urlfetch utility class / function.
Hey Joe, With the gdata package you can do something like this instead: As usual, completely untested code, but looks about right.. from youtube import YouTubeVideoFeedFromString def get_feeds_async(usernames): fetcher = megafetch.Fetcher() output = {} def cb(username, result): if isinstance(output, Exception): logging.error('could not fetch: %s', output) content = None else: content = YouTubeVideoFeedFromString(result.content) output[username] = content for username in usernames: url = 'http://gdata.youtube.com/feeds/api/users/%s/uploads' %\ (username,) fetcher.start(url, lambda result: cb(username, result)) fetcher.wait() return output feeds = get_feeds_async([ 'davemw', 'waverlyflams', 'googletechtalks', 'TheOnion', 'winterelaxation' ]) # feeds is now a mapping of usernames to YouTubeVideoFeed instances, or None if could not be fetched. 2009/3/18 Joe Bowman bowman.jos...@gmail.com: This may be a really dumb question, but.. I'm still learning so... Is there a way to do something other than a direct api call asynchronously? I'm writing a script that pulls from multiple sources, sometimes with higher level calls that use urlfetch, such as gdata. Since I'm attempting to pull from multiple sources, and sometimes multiple urls from each source, I'm trying to figure out if it's possible to run other methods at the same time. For example, I want to pull a youtube entry for several different authors. The youtube api doesn't allow multiple authors in a request (I have a enhancement request in for that though), so I need to do a yt_service.GetYouTubeVideoFeed() for each author, then splice them together into one feed. As I'm also working with Boss, and eventually Twitter, I'll have feeds to pull from those sources as well. My current application layout is using appengine-patch to provide django. I've set up a Boss and Youtube model with get methods that handle getting the data. So I can do something similar to: web_results = models.Boss.get(request.GET['term'], start=start) news_results = models.Boss.get(request.GET['term'], vertical=news, start=start) youtube = models.Youtube.get(request.GET['term'], start=start) Ideally, I'd like some of those models to be able to do asynchronous tasks within their get function, and then also, I'd like to run the above requests at the same, which should really speed the request up. On Mar 17, 9:20 am, Joe Bowman bowman.jos...@gmail.com wrote: Thanks, I'm going to give it a go for urlfetch calls for one project I'm working on this week. Not sure when I'd be able to include it in gaeutiltiies for cron and such, that project is currently lower on my priority list at the moment, but can't wait until I get a chance to play with it. Another idea I had for it is the ROTmodel (retry on timeout model) in the project, which could speed that process up. On Mar 17, 9:11 am, David Wilson d...@botanicus.net wrote: 2009/3/16 Joe Bowman bowman.jos...@gmail.com: Wow that's great. The SDK might be problematic for you, as it appears to be very single threaded, I know for a fact it can't reply to requests to itself. Out of curiosity, are you still using base urlfetch, or is it your own creation? While when Google releases their scheduled tasks functionality it will be less of an issue, if your solution had the ability to fire off urlfetch calls and not wait for a response, it could be a perfect fit for the gaeutilities cron utility. Currently it grabs a list of tasks it's supposed to run on request, sets a timestamp, runs one, the compares now() to the timestamp and if the timedelta is more than 1 second, stops running tasks and finishes the request. It already appears your project would be perfect for running all necessary tasks at once, and the MIT License I believe is compatible with the BSD license I've released gaeutilities, so would you have any personal objection to me including it in gaeutilities at some point, with proper attribution of course? Sorry I missed this in the first reply - yeah work away! :) David If you haven't see that project, it's url ishttp://gaeutilities.appspot.com/ On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote: Joe, I've only tested it in production. ;) The code should work serially on the SDK, but I haven't tried yet. David. 2009/3/16 Joe Bowman bowman.jos...@gmail.com: Does the batch fetching working on live appengine applications, or only on the SDK? On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote: I have no idea how definitive this is, but literally it means wall clock time seems to be how CPU cost is measured. I guess this makes sense for a few different reasons. I found some internal function
[google-appengine] Re: Parallel urlfetch utility class / function.
Ah ha.. thanks David. And for the views, if I really wanted to launch everything at once, I could map my boss, youtube, twitter, etc etc pulls to their own urls, and use megafetch in my master view to pull those urls all at once too. On Mar 18, 5:14 am, David Wilson d...@botanicus.net wrote: Hey Joe, With the gdata package you can do something like this instead: As usual, completely untested code, but looks about right.. from youtube import YouTubeVideoFeedFromString def get_feeds_async(usernames): fetcher = megafetch.Fetcher() output = {} def cb(username, result): if isinstance(output, Exception): logging.error('could not fetch: %s', output) content = None else: content = YouTubeVideoFeedFromString(result.content) output[username] = content for username in usernames: url = 'http://gdata.youtube.com/feeds/api/users/%s/uploads'%\ (username,) fetcher.start(url, lambda result: cb(username, result)) fetcher.wait() return output feeds = get_feeds_async([ 'davemw', 'waverlyflams', 'googletechtalks', 'TheOnion', 'winterelaxation' ]) # feeds is now a mapping of usernames to YouTubeVideoFeed instances, or None if could not be fetched. 2009/3/18 Joe Bowman bowman.jos...@gmail.com: This may be a really dumb question, but.. I'm still learning so... Is there a way to do something other than a direct api call asynchronously? I'm writing a script that pulls from multiple sources, sometimes with higher level calls that use urlfetch, such as gdata. Since I'm attempting to pull from multiple sources, and sometimes multiple urls from each source, I'm trying to figure out if it's possible to run other methods at the same time. For example, I want to pull a youtube entry for several different authors. The youtube api doesn't allow multiple authors in a request (I have a enhancement request in for that though), so I need to do a yt_service.GetYouTubeVideoFeed() for each author, then splice them together into one feed. As I'm also working with Boss, and eventually Twitter, I'll have feeds to pull from those sources as well. My current application layout is using appengine-patch to provide django. I've set up a Boss and Youtube model with get methods that handle getting the data. So I can do something similar to: web_results = models.Boss.get(request.GET['term'], start=start) news_results = models.Boss.get(request.GET['term'], vertical=news, start=start) youtube = models.Youtube.get(request.GET['term'], start=start) Ideally, I'd like some of those models to be able to do asynchronous tasks within their get function, and then also, I'd like to run the above requests at the same, which should really speed the request up. On Mar 17, 9:20 am, Joe Bowman bowman.jos...@gmail.com wrote: Thanks, I'm going to give it a go for urlfetch calls for one project I'm working on this week. Not sure when I'd be able to include it in gaeutiltiies for cron and such, that project is currently lower on my priority list at the moment, but can't wait until I get a chance to play with it. Another idea I had for it is the ROTmodel (retry on timeout model) in the project, which could speed that process up. On Mar 17, 9:11 am, David Wilson d...@botanicus.net wrote: 2009/3/16 Joe Bowman bowman.jos...@gmail.com: Wow that's great. The SDK might be problematic for you, as it appears to be very single threaded, I know for a fact it can't reply to requests to itself. Out of curiosity, are you still using base urlfetch, or is it your own creation? While when Google releases their scheduled tasks functionality it will be less of an issue, if your solution had the ability to fire off urlfetch calls and not wait for a response, it could be a perfect fit for the gaeutilities cron utility. Currently it grabs a list of tasks it's supposed to run on request, sets a timestamp, runs one, the compares now() to the timestamp and if the timedelta is more than 1 second, stops running tasks and finishes the request. It already appears your project would be perfect for running all necessary tasks at once, and the MIT License I believe is compatible with the BSD license I've released gaeutilities, so would you have any personal objection to me including it in gaeutilities at some point, with proper attribution of course? Sorry I missed this in the first reply - yeah work away! :) David If you haven't see that project, it's url ishttp://gaeutilities.appspot.com/ On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote: Joe, I've only tested it in production. ;) The code should work serially on the SDK, but I haven't tried yet. David. 2009/3/16 Joe Bowman bowman.jos...@gmail.com: Does the batch fetching
[google-appengine] Re: Parallel urlfetch utility class / function.
hey david,joe I've got the async datastore Get working but I'm not sure the callbacks are being run on a background thread. they appear to be when you examine something like the thread local storage (hashes are all unique) but then if you insert just a simple time.sleep they appear to run serially. (note - while not completely new to async code, this is my first run with python so I'm not sure of the threading contentions of something like sleep or logging.debug) I would like to be able to run some code just after the fetch for each entity, the hope is that this would be run in parallel any thoughts? cheers brian On Mar 18, 6:14 am, Joe Bowman bowman.jos...@gmail.com wrote: Ah ha.. thanks David. And for the views, if I really wanted to launch everything at once, I could map my boss, youtube, twitter, etc etc pulls to their own urls, and use megafetch in my master view to pull those urls all at once too. On Mar 18, 5:14 am, David Wilson d...@botanicus.net wrote: Hey Joe, With the gdata package you can do something like this instead: As usual, completely untested code, but looks about right.. from youtube import YouTubeVideoFeedFromString def get_feeds_async(usernames): fetcher = megafetch.Fetcher() output = {} def cb(username, result): if isinstance(output, Exception): logging.error('could not fetch: %s', output) content = None else: content = YouTubeVideoFeedFromString(result.content) output[username] = content for username in usernames: url = 'http://gdata.youtube.com/feeds/api/users/%s/uploads'%\ (username,) fetcher.start(url, lambda result: cb(username, result)) fetcher.wait() return output feeds = get_feeds_async([ 'davemw', 'waverlyflams', 'googletechtalks', 'TheOnion', 'winterelaxation' ]) # feeds is now a mapping of usernames to YouTubeVideoFeed instances, or None if could not be fetched. 2009/3/18 Joe Bowman bowman.jos...@gmail.com: This may be a really dumb question, but.. I'm still learning so... Is there a way to do something other than a direct api call asynchronously? I'm writing a script that pulls from multiple sources, sometimes with higher level calls that use urlfetch, such as gdata. Since I'm attempting to pull from multiple sources, and sometimes multiple urls from each source, I'm trying to figure out if it's possible to run other methods at the same time. For example, I want to pull a youtube entry for several different authors. The youtube api doesn't allow multiple authors in a request (I have a enhancement request in for that though), so I need to do a yt_service.GetYouTubeVideoFeed() for each author, then splice them together into one feed. As I'm also working with Boss, and eventually Twitter, I'll have feeds to pull from those sources as well. My current application layout is using appengine-patch to provide django. I've set up a Boss and Youtube model with get methods that handle getting the data. So I can do something similar to: web_results = models.Boss.get(request.GET['term'], start=start) news_results = models.Boss.get(request.GET['term'], vertical=news, start=start) youtube = models.Youtube.get(request.GET['term'], start=start) Ideally, I'd like some of those models to be able to do asynchronous tasks within their get function, and then also, I'd like to run the above requests at the same, which should really speed the request up. On Mar 17, 9:20 am, Joe Bowman bowman.jos...@gmail.com wrote: Thanks, I'm going to give it a go for urlfetch calls for one project I'm working on this week. Not sure when I'd be able to include it in gaeutiltiies for cron and such, that project is currently lower on my priority list at the moment, but can't wait until I get a chance to play with it. Another idea I had for it is the ROTmodel (retry on timeout model) in the project, which could speed that process up. On Mar 17, 9:11 am, David Wilson d...@botanicus.net wrote: 2009/3/16 Joe Bowman bowman.jos...@gmail.com: Wow that's great. The SDK might be problematic for you, as it appears to be very single threaded, I know for a fact it can't reply to requests to itself. Out of curiosity, are you still using base urlfetch, or is it your own creation? While when Google releases their scheduled tasks functionality it will be less of an issue, if your solution had the ability to fire off urlfetch calls and not wait for a response, it could be a perfect fit for the gaeutilities cron utility. Currently it grabs a list of tasks it's supposed to run on request, sets a timestamp, runs one, the compares now() to the timestamp and if the timedelta is more than 1 second, stops running tasks and finishes the
[google-appengine] Re: Parallel urlfetch utility class / function.
This may be a really dumb question, but.. I'm still learning so... Is there a way to do something other than a direct api call asynchronously? I'm writing a script that pulls from multiple sources, sometimes with higher level calls that use urlfetch, such as gdata. Since I'm attempting to pull from multiple sources, and sometimes multiple urls from each source, I'm trying to figure out if it's possible to run other methods at the same time. For example, I want to pull a youtube entry for several different authors. The youtube api doesn't allow multiple authors in a request (I have a enhancement request in for that though), so I need to do a yt_service.GetYouTubeVideoFeed() for each author, then splice them together into one feed. As I'm also working with Boss, and eventually Twitter, I'll have feeds to pull from those sources as well. My current application layout is using appengine-patch to provide django. I've set up a Boss and Youtube model with get methods that handle getting the data. So I can do something similar to: web_results = models.Boss.get(request.GET['term'], start=start) news_results = models.Boss.get(request.GET['term'], vertical=news, start=start) youtube = models.Youtube.get(request.GET['term'], start=start) Ideally, I'd like some of those models to be able to do asynchronous tasks within their get function, and then also, I'd like to run the above requests at the same, which should really speed the request up. On Mar 17, 9:20 am, Joe Bowman bowman.jos...@gmail.com wrote: Thanks, I'm going to give it a go for urlfetch calls for one project I'm working on this week. Not sure when I'd be able to include it in gaeutiltiies for cron and such, that project is currently lower on my priority list at the moment, but can't wait until I get a chance to play with it. Another idea I had for it is the ROTmodel (retry on timeout model) in the project, which could speed that process up. On Mar 17, 9:11 am, David Wilson d...@botanicus.net wrote: 2009/3/16 Joe Bowman bowman.jos...@gmail.com: Wow that's great. The SDK might be problematic for you, as it appears to be very single threaded, I know for a fact it can't reply to requests to itself. Out of curiosity, are you still using base urlfetch, or is it your own creation? While when Google releases their scheduled tasks functionality it will be less of an issue, if your solution had the ability to fire off urlfetch calls and not wait for a response, it could be a perfect fit for the gaeutilities cron utility. Currently it grabs a list of tasks it's supposed to run on request, sets a timestamp, runs one, the compares now() to the timestamp and if the timedelta is more than 1 second, stops running tasks and finishes the request. It already appears your project would be perfect for running all necessary tasks at once, and the MIT License I believe is compatible with the BSD license I've released gaeutilities, so would you have any personal objection to me including it in gaeutilities at some point, with proper attribution of course? Sorry I missed this in the first reply - yeah work away! :) David If you haven't see that project, it's url ishttp://gaeutilities.appspot.com/ On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote: Joe, I've only tested it in production. ;) The code should work serially on the SDK, but I haven't tried yet. David. 2009/3/16 Joe Bowman bowman.jos...@gmail.com: Does the batch fetching working on live appengine applications, or only on the SDK? On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote: I have no idea how definitive this is, but literally it means wall clock time seems to be how CPU cost is measured. I guess this makes sense for a few different reasons. I found some internal function google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ est_cpu_usage with the docstring: Returns the number of megacycles used so far by this request. Does not include CPU used by API calls. Calling it, then running time.sleep(5), then calling it again, indicates thousands of megacycles used, yet in real terms the CPU was probably doing nothing. I guess Datastore CPU, etc., is added on top of this, but it seems to suggest to me that if you can drastically reduce request time, quota usage should drop too. I have yet to do any kind of rough measurements of Datastore CPU, so I'm not sure how correct this all is. David. - One of the guys on IRC suggested this means that per-request cost is scaled during peak usage (and thus internal services running slower). 2009/3/16 peterk peter.ke...@gmail.com: A couple of questions re. CPU usage.. CPU time quota appears to be calculated based on literal time Can you clarify what you mean here? I presume each async request eats
[google-appengine] Re: Parallel urlfetch utility class / function.
Very neat.. Thank you. Just to clarify, can we use this for all API calls? Datastore too? I didn't look very closely at the async proxy in pubsubhubub.. Asynchronous calls available on all apis might give a lot to chew on.. :) It's been a while since I've worked with async function calls or threading, might have to dig up some old notes to see where I could extract gains from it in my app. Some common cases might be worth the community documenting for all to benefit from, too. On Mar 16, 1:26 pm, David Wilson d...@botanicus.net wrote: I've created a Google Code project to contain some batch utilities I'm working on, based on async_apiproxy.py from pubsubhubbub[0]. The project currently contains just a modified async_apiproxy.py that doesn't require dummy google3 modules on the local machine, and a megafetch.py, for batch-fetching URLs. http://code.google.com/p/appengine-async-tools/ David [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a... -- It is better to be wrong than to be vague. — Freeman Dyson --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en -~--~~~~--~~--~--~---
[google-appengine] Re: Parallel urlfetch utility class / function.
It's completely undocumented (at this stage, anyway), but definitely seems to work. A few notes I've come gathered: - CPU time quota appears to be calculated based on literal time, rather than e.g. the UNIX concept of time spent in running state. - I can fetch 100 URLs in 1.3 seconds from a machine colocated in Germany using the asynchronous API. I can't begin to imagine how slow (and therefore expensive in monetary terms) this would be using the standard API. - The user-specified callback function appears to be invoked in a separate thread; the RPC isn't complete until this callback completes. The callback thread is still subject to the request deadline. - It's a standard interface, and seems to have no parallel restrictions at least for urlfetch and Datastore. However, I imagine that it's possible restrictions may be placed here at some later stage, since you can burn a whole lot more AppEngine CPU more cheaply using the async api. - It's standard only insomuch as you have to fiddle with AppEngine-internal protocolbuffer definitions for each service type. This mostly means copy-pasting the standard sync call code from the SDK, and hacking it to use pubsubhubub's proxy code. Per the last point, you might be better waiting for an officially sanctioned API for doing this, albeit I doubt the protocolbuffer definitions change all that often. Thanks for Brett Slatkin co. for doing the digging required to get the async stuff working! :) David. 2009/3/16 peterk peter.ke...@gmail.com: Very neat.. Thank you. Just to clarify, can we use this for all API calls? Datastore too? I didn't look very closely at the async proxy in pubsubhubub.. Asynchronous calls available on all apis might give a lot to chew on.. :) It's been a while since I've worked with async function calls or threading, might have to dig up some old notes to see where I could extract gains from it in my app. Some common cases might be worth the community documenting for all to benefit from, too. On Mar 16, 1:26 pm, David Wilson d...@botanicus.net wrote: I've created a Google Code project to contain some batch utilities I'm working on, based on async_apiproxy.py from pubsubhubbub[0]. The project currently contains just a modified async_apiproxy.py that doesn't require dummy google3 modules on the local machine, and a megafetch.py, for batch-fetching URLs. http://code.google.com/p/appengine-async-tools/ David [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a... -- It is better to be wrong than to be vague. — Freeman Dyson -- It is better to be wrong than to be vague. — Freeman Dyson --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en -~--~~~~--~~--~--~---
[google-appengine] Re: Parallel urlfetch utility class / function.
A couple of questions re. CPU usage.. CPU time quota appears to be calculated based on literal time Can you clarify what you mean here? I presume each async request eats into your CPU budget. But you say: since you can burn a whole lot more AppEngine CPU more cheaply using the async api Can you clarify how that's the case? I would guess as long as you're being billed for the cpu-ms spent in your asynchronous calls, Google would let you hang yourself with them when it comes to billing.. :) so I presume they'd let you squeeze in as many as your original request, and its limit, will allow for? Thanks again. On Mar 16, 2:00 pm, David Wilson d...@botanicus.net wrote: It's completely undocumented (at this stage, anyway), but definitely seems to work. A few notes I've come gathered: - CPU time quota appears to be calculated based on literal time, rather than e.g. the UNIX concept of time spent in running state. - I can fetch 100 URLs in 1.3 seconds from a machine colocated in Germany using the asynchronous API. I can't begin to imagine how slow (and therefore expensive in monetary terms) this would be using the standard API. - The user-specified callback function appears to be invoked in a separate thread; the RPC isn't complete until this callback completes. The callback thread is still subject to the request deadline. - It's a standard interface, and seems to have no parallel restrictions at least for urlfetch and Datastore. However, I imagine that it's possible restrictions may be placed here at some later stage, since you can burn a whole lot more AppEngine CPU more cheaply using the async api. - It's standard only insomuch as you have to fiddle with AppEngine-internal protocolbuffer definitions for each service type. This mostly means copy-pasting the standard sync call code from the SDK, and hacking it to use pubsubhubub's proxy code. Per the last point, you might be better waiting for an officially sanctioned API for doing this, albeit I doubt the protocolbuffer definitions change all that often. Thanks for Brett Slatkin co. for doing the digging required to get the async stuff working! :) David. 2009/3/16 peterk peter.ke...@gmail.com: Very neat.. Thank you. Just to clarify, can we use this for all API calls? Datastore too? I didn't look very closely at the async proxy in pubsubhubub.. Asynchronous calls available on all apis might give a lot to chew on.. :) It's been a while since I've worked with async function calls or threading, might have to dig up some old notes to see where I could extract gains from it in my app. Some common cases might be worth the community documenting for all to benefit from, too. On Mar 16, 1:26 pm, David Wilson d...@botanicus.net wrote: I've created a Google Code project to contain some batch utilities I'm working on, based on async_apiproxy.py from pubsubhubbub[0]. The project currently contains just a modified async_apiproxy.py that doesn't require dummy google3 modules on the local machine, and a megafetch.py, for batch-fetching URLs. http://code.google.com/p/appengine-async-tools/ David [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a... -- It is better to be wrong than to be vague. — Freeman Dyson -- It is better to be wrong than to be vague. — Freeman Dyson --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en -~--~~~~--~~--~--~---
[google-appengine] Re: Parallel urlfetch utility class / function.
oh my, this is working now?!? I just assumed it would only be available from the next build. great work david! I agree on waiting for the official release but its certainly something that we can test with right now in preparation for the new release. thanks for digging this out (and thanks to Brett Slatkin as well) cheers brian On Mar 16, 10:00 am, David Wilson d...@botanicus.net wrote: It's completely undocumented (at this stage, anyway), but definitely seems to work. A few notes I've come gathered: - CPU time quota appears to be calculated based on literal time, rather than e.g. the UNIX concept of time spent in running state. - I can fetch 100 URLs in 1.3 seconds from a machine colocated in Germany using the asynchronous API. I can't begin to imagine how slow (and therefore expensive in monetary terms) this would be using the standard API. - The user-specified callback function appears to be invoked in a separate thread; the RPC isn't complete until this callback completes. The callback thread is still subject to the request deadline. - It's a standard interface, and seems to have no parallel restrictions at least for urlfetch and Datastore. However, I imagine that it's possible restrictions may be placed here at some later stage, since you can burn a whole lot more AppEngine CPU more cheaply using the async api. - It's standard only insomuch as you have to fiddle with AppEngine-internal protocolbuffer definitions for each service type. This mostly means copy-pasting the standard sync call code from the SDK, and hacking it to use pubsubhubub's proxy code. Per the last point, you might be better waiting for an officially sanctioned API for doing this, albeit I doubt the protocolbuffer definitions change all that often. Thanks for Brett Slatkin co. for doing the digging required to get the async stuff working! :) David. 2009/3/16 peterk peter.ke...@gmail.com: Very neat.. Thank you. Just to clarify, can we use this for all API calls? Datastore too? I didn't look very closely at the async proxy in pubsubhubub.. Asynchronous calls available on all apis might give a lot to chew on.. :) It's been a while since I've worked with async function calls or threading, might have to dig up some old notes to see where I could extract gains from it in my app. Some common cases might be worth the community documenting for all to benefit from, too. On Mar 16, 1:26 pm, David Wilson d...@botanicus.net wrote: I've created a Google Code project to contain some batch utilities I'm working on, based on async_apiproxy.py from pubsubhubbub[0]. The project currently contains just a modified async_apiproxy.py that doesn't require dummy google3 modules on the local machine, and a megafetch.py, for batch-fetching URLs. http://code.google.com/p/appengine-async-tools/ David [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a... -- It is better to be wrong than to be vague. — Freeman Dyson -- It is better to be wrong than to be vague. — Freeman Dyson --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en -~--~~~~--~~--~--~---
[google-appengine] Re: Parallel urlfetch utility class / function.
I have no idea how definitive this is, but literally it means wall clock time seems to be how CPU cost is measured. I guess this makes sense for a few different reasons. I found some internal function google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_request_cpu_usage with the docstring: Returns the number of megacycles used so far by this request. Does not include CPU used by API calls. Calling it, then running time.sleep(5), then calling it again, indicates thousands of megacycles used, yet in real terms the CPU was probably doing nothing. I guess Datastore CPU, etc., is added on top of this, but it seems to suggest to me that if you can drastically reduce request time, quota usage should drop too. I have yet to do any kind of rough measurements of Datastore CPU, so I'm not sure how correct this all is. David. - One of the guys on IRC suggested this means that per-request cost is scaled during peak usage (and thus internal services running slower). 2009/3/16 peterk peter.ke...@gmail.com: A couple of questions re. CPU usage.. CPU time quota appears to be calculated based on literal time Can you clarify what you mean here? I presume each async request eats into your CPU budget. But you say: since you can burn a whole lot more AppEngine CPU more cheaply using the async api Can you clarify how that's the case? I would guess as long as you're being billed for the cpu-ms spent in your asynchronous calls, Google would let you hang yourself with them when it comes to billing.. :) so I presume they'd let you squeeze in as many as your original request, and its limit, will allow for? Thanks again. On Mar 16, 2:00 pm, David Wilson d...@botanicus.net wrote: It's completely undocumented (at this stage, anyway), but definitely seems to work. A few notes I've come gathered: - CPU time quota appears to be calculated based on literal time, rather than e.g. the UNIX concept of time spent in running state. - I can fetch 100 URLs in 1.3 seconds from a machine colocated in Germany using the asynchronous API. I can't begin to imagine how slow (and therefore expensive in monetary terms) this would be using the standard API. - The user-specified callback function appears to be invoked in a separate thread; the RPC isn't complete until this callback completes. The callback thread is still subject to the request deadline. - It's a standard interface, and seems to have no parallel restrictions at least for urlfetch and Datastore. However, I imagine that it's possible restrictions may be placed here at some later stage, since you can burn a whole lot more AppEngine CPU more cheaply using the async api. - It's standard only insomuch as you have to fiddle with AppEngine-internal protocolbuffer definitions for each service type. This mostly means copy-pasting the standard sync call code from the SDK, and hacking it to use pubsubhubub's proxy code. Per the last point, you might be better waiting for an officially sanctioned API for doing this, albeit I doubt the protocolbuffer definitions change all that often. Thanks for Brett Slatkin co. for doing the digging required to get the async stuff working! :) David. 2009/3/16 peterk peter.ke...@gmail.com: Very neat.. Thank you. Just to clarify, can we use this for all API calls? Datastore too? I didn't look very closely at the async proxy in pubsubhubub.. Asynchronous calls available on all apis might give a lot to chew on.. :) It's been a while since I've worked with async function calls or threading, might have to dig up some old notes to see where I could extract gains from it in my app. Some common cases might be worth the community documenting for all to benefit from, too. On Mar 16, 1:26 pm, David Wilson d...@botanicus.net wrote: I've created a Google Code project to contain some batch utilities I'm working on, based on async_apiproxy.py from pubsubhubbub[0]. The project currently contains just a modified async_apiproxy.py that doesn't require dummy google3 modules on the local machine, and a megafetch.py, for batch-fetching URLs. http://code.google.com/p/appengine-async-tools/ David [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a... -- It is better to be wrong than to be vague. — Freeman Dyson -- It is better to be wrong than to be vague. — Freeman Dyson -- It is better to be wrong than to be vague. — Freeman Dyson --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en -~--~~~~--~~--~--~---
[google-appengine] Re: Parallel urlfetch utility class / function.
Does the batch fetching working on live appengine applications, or only on the SDK? On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote: I have no idea how definitive this is, but literally it means wall clock time seems to be how CPU cost is measured. I guess this makes sense for a few different reasons. I found some internal function google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_request_cpu_usage with the docstring: Returns the number of megacycles used so far by this request. Does not include CPU used by API calls. Calling it, then running time.sleep(5), then calling it again, indicates thousands of megacycles used, yet in real terms the CPU was probably doing nothing. I guess Datastore CPU, etc., is added on top of this, but it seems to suggest to me that if you can drastically reduce request time, quota usage should drop too. I have yet to do any kind of rough measurements of Datastore CPU, so I'm not sure how correct this all is. David. - One of the guys on IRC suggested this means that per-request cost is scaled during peak usage (and thus internal services running slower). 2009/3/16 peterk peter.ke...@gmail.com: A couple of questions re. CPU usage.. CPU time quota appears to be calculated based on literal time Can you clarify what you mean here? I presume each async request eats into your CPU budget. But you say: since you can burn a whole lot more AppEngine CPU more cheaply using the async api Can you clarify how that's the case? I would guess as long as you're being billed for the cpu-ms spent in your asynchronous calls, Google would let you hang yourself with them when it comes to billing.. :) so I presume they'd let you squeeze in as many as your original request, and its limit, will allow for? Thanks again. On Mar 16, 2:00 pm, David Wilson d...@botanicus.net wrote: It's completely undocumented (at this stage, anyway), but definitely seems to work. A few notes I've come gathered: - CPU time quota appears to be calculated based on literal time, rather than e.g. the UNIX concept of time spent in running state. - I can fetch 100 URLs in 1.3 seconds from a machine colocated in Germany using the asynchronous API. I can't begin to imagine how slow (and therefore expensive in monetary terms) this would be using the standard API. - The user-specified callback function appears to be invoked in a separate thread; the RPC isn't complete until this callback completes. The callback thread is still subject to the request deadline. - It's a standard interface, and seems to have no parallel restrictions at least for urlfetch and Datastore. However, I imagine that it's possible restrictions may be placed here at some later stage, since you can burn a whole lot more AppEngine CPU more cheaply using the async api. - It's standard only insomuch as you have to fiddle with AppEngine-internal protocolbuffer definitions for each service type. This mostly means copy-pasting the standard sync call code from the SDK, and hacking it to use pubsubhubub's proxy code. Per the last point, you might be better waiting for an officially sanctioned API for doing this, albeit I doubt the protocolbuffer definitions change all that often. Thanks for Brett Slatkin co. for doing the digging required to get the async stuff working! :) David. 2009/3/16 peterk peter.ke...@gmail.com: Very neat.. Thank you. Just to clarify, can we use this for all API calls? Datastore too? I didn't look very closely at the async proxy in pubsubhubub.. Asynchronous calls available on all apis might give a lot to chew on.. :) It's been a while since I've worked with async function calls or threading, might have to dig up some old notes to see where I could extract gains from it in my app. Some common cases might be worth the community documenting for all to benefit from, too. On Mar 16, 1:26 pm, David Wilson d...@botanicus.net wrote: I've created a Google Code project to contain some batch utilities I'm working on, based on async_apiproxy.py from pubsubhubbub[0]. The project currently contains just a modified async_apiproxy.py that doesn't require dummy google3 modules on the local machine, and a megafetch.py, for batch-fetching URLs. http://code.google.com/p/appengine-async-tools/ David [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a... -- It is better to be wrong than to be vague. — Freeman Dyson -- It is better to be wrong than to be vague. — Freeman Dyson -- It is better to be wrong than to be vague. — Freeman Dyson --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com To unsubscribe from this group,
[google-appengine] Re: Parallel urlfetch utility class / function.
Joe, I've only tested it in production. ;) The code should work serially on the SDK, but I haven't tried yet. David. 2009/3/16 Joe Bowman bowman.jos...@gmail.com: Does the batch fetching working on live appengine applications, or only on the SDK? On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote: I have no idea how definitive this is, but literally it means wall clock time seems to be how CPU cost is measured. I guess this makes sense for a few different reasons. I found some internal function google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_request_cpu_usage with the docstring: Returns the number of megacycles used so far by this request. Does not include CPU used by API calls. Calling it, then running time.sleep(5), then calling it again, indicates thousands of megacycles used, yet in real terms the CPU was probably doing nothing. I guess Datastore CPU, etc., is added on top of this, but it seems to suggest to me that if you can drastically reduce request time, quota usage should drop too. I have yet to do any kind of rough measurements of Datastore CPU, so I'm not sure how correct this all is. David. - One of the guys on IRC suggested this means that per-request cost is scaled during peak usage (and thus internal services running slower). 2009/3/16 peterk peter.ke...@gmail.com: A couple of questions re. CPU usage.. CPU time quota appears to be calculated based on literal time Can you clarify what you mean here? I presume each async request eats into your CPU budget. But you say: since you can burn a whole lot more AppEngine CPU more cheaply using the async api Can you clarify how that's the case? I would guess as long as you're being billed for the cpu-ms spent in your asynchronous calls, Google would let you hang yourself with them when it comes to billing.. :) so I presume they'd let you squeeze in as many as your original request, and its limit, will allow for? Thanks again. On Mar 16, 2:00 pm, David Wilson d...@botanicus.net wrote: It's completely undocumented (at this stage, anyway), but definitely seems to work. A few notes I've come gathered: - CPU time quota appears to be calculated based on literal time, rather than e.g. the UNIX concept of time spent in running state. - I can fetch 100 URLs in 1.3 seconds from a machine colocated in Germany using the asynchronous API. I can't begin to imagine how slow (and therefore expensive in monetary terms) this would be using the standard API. - The user-specified callback function appears to be invoked in a separate thread; the RPC isn't complete until this callback completes. The callback thread is still subject to the request deadline. - It's a standard interface, and seems to have no parallel restrictions at least for urlfetch and Datastore. However, I imagine that it's possible restrictions may be placed here at some later stage, since you can burn a whole lot more AppEngine CPU more cheaply using the async api. - It's standard only insomuch as you have to fiddle with AppEngine-internal protocolbuffer definitions for each service type. This mostly means copy-pasting the standard sync call code from the SDK, and hacking it to use pubsubhubub's proxy code. Per the last point, you might be better waiting for an officially sanctioned API for doing this, albeit I doubt the protocolbuffer definitions change all that often. Thanks for Brett Slatkin co. for doing the digging required to get the async stuff working! :) David. 2009/3/16 peterk peter.ke...@gmail.com: Very neat.. Thank you. Just to clarify, can we use this for all API calls? Datastore too? I didn't look very closely at the async proxy in pubsubhubub.. Asynchronous calls available on all apis might give a lot to chew on.. :) It's been a while since I've worked with async function calls or threading, might have to dig up some old notes to see where I could extract gains from it in my app. Some common cases might be worth the community documenting for all to benefit from, too. On Mar 16, 1:26 pm, David Wilson d...@botanicus.net wrote: I've created a Google Code project to contain some batch utilities I'm working on, based on async_apiproxy.py from pubsubhubbub[0]. The project currently contains just a modified async_apiproxy.py that doesn't require dummy google3 modules on the local machine, and a megafetch.py, for batch-fetching URLs. http://code.google.com/p/appengine-async-tools/ David [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a... -- It is better to be wrong than to be vague. — Freeman Dyson -- It is better to be wrong than to be vague. — Freeman Dyson -- It is better to be wrong than to be vague. — Freeman Dyson -- It is better to be wrong than to be vague. — Freeman Dyson
[google-appengine] Re: Parallel urlfetch utility class / function.
Wow that's great. The SDK might be problematic for you, as it appears to be very single threaded, I know for a fact it can't reply to requests to itself. Out of curiosity, are you still using base urlfetch, or is it your own creation? While when Google releases their scheduled tasks functionality it will be less of an issue, if your solution had the ability to fire off urlfetch calls and not wait for a response, it could be a perfect fit for the gaeutilities cron utility. Currently it grabs a list of tasks it's supposed to run on request, sets a timestamp, runs one, the compares now() to the timestamp and if the timedelta is more than 1 second, stops running tasks and finishes the request. It already appears your project would be perfect for running all necessary tasks at once, and the MIT License I believe is compatible with the BSD license I've released gaeutilities, so would you have any personal objection to me including it in gaeutilities at some point, with proper attribution of course? If you haven't see that project, it's url is http://gaeutilities.appspot.com/ On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote: Joe, I've only tested it in production. ;) The code should work serially on the SDK, but I haven't tried yet. David. 2009/3/16 Joe Bowman bowman.jos...@gmail.com: Does the batch fetching working on live appengine applications, or only on the SDK? On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote: I have no idea how definitive this is, but literally it means wall clock time seems to be how CPU cost is measured. I guess this makes sense for a few different reasons. I found some internal function google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_request_cpu_usage with the docstring: Returns the number of megacycles used so far by this request. Does not include CPU used by API calls. Calling it, then running time.sleep(5), then calling it again, indicates thousands of megacycles used, yet in real terms the CPU was probably doing nothing. I guess Datastore CPU, etc., is added on top of this, but it seems to suggest to me that if you can drastically reduce request time, quota usage should drop too. I have yet to do any kind of rough measurements of Datastore CPU, so I'm not sure how correct this all is. David. - One of the guys on IRC suggested this means that per-request cost is scaled during peak usage (and thus internal services running slower). 2009/3/16 peterk peter.ke...@gmail.com: A couple of questions re. CPU usage.. CPU time quota appears to be calculated based on literal time Can you clarify what you mean here? I presume each async request eats into your CPU budget. But you say: since you can burn a whole lot more AppEngine CPU more cheaply using the async api Can you clarify how that's the case? I would guess as long as you're being billed for the cpu-ms spent in your asynchronous calls, Google would let you hang yourself with them when it comes to billing.. :) so I presume they'd let you squeeze in as many as your original request, and its limit, will allow for? Thanks again. On Mar 16, 2:00 pm, David Wilson d...@botanicus.net wrote: It's completely undocumented (at this stage, anyway), but definitely seems to work. A few notes I've come gathered: - CPU time quota appears to be calculated based on literal time, rather than e.g. the UNIX concept of time spent in running state. - I can fetch 100 URLs in 1.3 seconds from a machine colocated in Germany using the asynchronous API. I can't begin to imagine how slow (and therefore expensive in monetary terms) this would be using the standard API. - The user-specified callback function appears to be invoked in a separate thread; the RPC isn't complete until this callback completes. The callback thread is still subject to the request deadline. - It's a standard interface, and seems to have no parallel restrictions at least for urlfetch and Datastore. However, I imagine that it's possible restrictions may be placed here at some later stage, since you can burn a whole lot more AppEngine CPU more cheaply using the async api. - It's standard only insomuch as you have to fiddle with AppEngine-internal protocolbuffer definitions for each service type. This mostly means copy-pasting the standard sync call code from the SDK, and hacking it to use pubsubhubub's proxy code. Per the last point, you might be better waiting for an officially sanctioned API for doing this, albeit I doubt the protocolbuffer definitions change all that often. Thanks for Brett Slatkin co. for doing the digging required to get the async stuff working! :) David. 2009/3/16 peterk peter.ke...@gmail.com: Very neat.. Thank you. Just to clarify, can we use this for all API calls? Datastore too? I didn't look
[google-appengine] Re: Parallel urlfetch utility class / function.
@joe - fire/forget - you can just skip the fetcher.wait() call (which call AsyncAPIProxy.wait). I'm not sure of you would need a valid callback but even if you did it could be a simple stub that does nothing. @david - have you made this work with datastore calls yet? having some issues trying to figure out how to set pbrequest/pbresponse variables cheers brian On Mar 16, 12:05 pm, Joe Bowman bowman.jos...@gmail.com wrote: Wow that's great. The SDK might be problematic for you, as it appears to be very single threaded, I know for a fact it can't reply to requests to itself. Out of curiosity, are you still using base urlfetch, or is it your own creation? While when Google releases their scheduled tasks functionality it will be less of an issue, if your solution had the ability to fire off urlfetch calls and not wait for a response, it could be a perfect fit for the gaeutilities cron utility. Currently it grabs a list of tasks it's supposed to run on request, sets a timestamp, runs one, the compares now() to the timestamp and if the timedelta is more than 1 second, stops running tasks and finishes the request. It already appears your project would be perfect for running all necessary tasks at once, and the MIT License I believe is compatible with the BSD license I've released gaeutilities, so would you have any personal objection to me including it in gaeutilities at some point, with proper attribution of course? If you haven't see that project, it's url ishttp://gaeutilities.appspot.com/ On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote: Joe, I've only tested it in production. ;) The code should work serially on the SDK, but I haven't tried yet. David. 2009/3/16 Joe Bowman bowman.jos...@gmail.com: Does the batch fetching working on live appengine applications, or only on the SDK? On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote: I have no idea how definitive this is, but literally it means wall clock time seems to be how CPU cost is measured. I guess this makes sense for a few different reasons. I found some internal function google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ est_cpu_usage with the docstring: Returns the number of megacycles used so far by this request. Does not include CPU used by API calls. Calling it, then running time.sleep(5), then calling it again, indicates thousands of megacycles used, yet in real terms the CPU was probably doing nothing. I guess Datastore CPU, etc., is added on top of this, but it seems to suggest to me that if you can drastically reduce request time, quota usage should drop too. I have yet to do any kind of rough measurements of Datastore CPU, so I'm not sure how correct this all is. David. - One of the guys on IRC suggested this means that per-request cost is scaled during peak usage (and thus internal services running slower). 2009/3/16 peterk peter.ke...@gmail.com: A couple of questions re. CPU usage.. CPU time quota appears to be calculated based on literal time Can you clarify what you mean here? I presume each async request eats into your CPU budget. But you say: since you can burn a whole lot more AppEngine CPU more cheaply using the async api Can you clarify how that's the case? I would guess as long as you're being billed for the cpu-ms spent in your asynchronous calls, Google would let you hang yourself with them when it comes to billing.. :) so I presume they'd let you squeeze in as many as your original request, and its limit, will allow for? Thanks again. On Mar 16, 2:00 pm, David Wilson d...@botanicus.net wrote: It's completely undocumented (at this stage, anyway), but definitely seems to work. A few notes I've come gathered: - CPU time quota appears to be calculated based on literal time, rather than e.g. the UNIX concept of time spent in running state. - I can fetch 100 URLs in 1.3 seconds from a machine colocated in Germany using the asynchronous API. I can't begin to imagine how slow (and therefore expensive in monetary terms) this would be using the standard API. - The user-specified callback function appears to be invoked in a separate thread; the RPC isn't complete until this callback completes. The callback thread is still subject to the request deadline. - It's a standard interface, and seems to have no parallel restrictions at least for urlfetch and Datastore. However, I imagine that it's possible restrictions may be placed here at some later stage, since you can burn a whole lot more AppEngine CPU more cheaply using the async api. - It's standard only insomuch as you have to fiddle with AppEngine-internal protocolbuffer definitions for each service type. This mostly means copy-pasting the standard sync call code from
[google-appengine] Re: Parallel urlfetch utility class / function.
I forgot to mention, AppEngine does not close the request until all asynchronous requests have ended. This means it's not truly fire and forget. Regardless of whether you're waiting for a response or not, if a request is in progress, the HTTP response body is not returned to the client. I created a simple function this morning to call datastore_v3.Delete on a set of key objects, it appeared to work but I didn't test beyond ensuring the callback didn't receive an exception. Pretty untested code here: http://pastie.org/417496. For simple uses, it's probably not all that useful to call Datastore asynchronously is all that useful anyway, since unlike urlfetch, you can already minimize latency by making batch calls at the start/end of your request for all the keys you want to load/save. It's possibly useful to use it to concurrently commit a bunch of different transactions, but the code for this is less trivial than the urlfetch case. Probably best to see what the AppEngine team themselves provide for this. ;) David. 2009/3/16 bFlood bflood...@gmail.com: @joe - fire/forget - you can just skip the fetcher.wait() call (which call AsyncAPIProxy.wait). I'm not sure of you would need a valid callback but even if you did it could be a simple stub that does nothing. @david - have you made this work with datastore calls yet? having some issues trying to figure out how to set pbrequest/pbresponse variables cheers brian On Mar 16, 12:05 pm, Joe Bowman bowman.jos...@gmail.com wrote: Wow that's great. The SDK might be problematic for you, as it appears to be very single threaded, I know for a fact it can't reply to requests to itself. Out of curiosity, are you still using base urlfetch, or is it your own creation? While when Google releases their scheduled tasks functionality it will be less of an issue, if your solution had the ability to fire off urlfetch calls and not wait for a response, it could be a perfect fit for the gaeutilities cron utility. Currently it grabs a list of tasks it's supposed to run on request, sets a timestamp, runs one, the compares now() to the timestamp and if the timedelta is more than 1 second, stops running tasks and finishes the request. It already appears your project would be perfect for running all necessary tasks at once, and the MIT License I believe is compatible with the BSD license I've released gaeutilities, so would you have any personal objection to me including it in gaeutilities at some point, with proper attribution of course? If you haven't see that project, it's url ishttp://gaeutilities.appspot.com/ On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote: Joe, I've only tested it in production. ;) The code should work serially on the SDK, but I haven't tried yet. David. 2009/3/16 Joe Bowman bowman.jos...@gmail.com: Does the batch fetching working on live appengine applications, or only on the SDK? On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote: I have no idea how definitive this is, but literally it means wall clock time seems to be how CPU cost is measured. I guess this makes sense for a few different reasons. I found some internal function google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ est_cpu_usage with the docstring: Returns the number of megacycles used so far by this request. Does not include CPU used by API calls. Calling it, then running time.sleep(5), then calling it again, indicates thousands of megacycles used, yet in real terms the CPU was probably doing nothing. I guess Datastore CPU, etc., is added on top of this, but it seems to suggest to me that if you can drastically reduce request time, quota usage should drop too. I have yet to do any kind of rough measurements of Datastore CPU, so I'm not sure how correct this all is. David. - One of the guys on IRC suggested this means that per-request cost is scaled during peak usage (and thus internal services running slower). 2009/3/16 peterk peter.ke...@gmail.com: A couple of questions re. CPU usage.. CPU time quota appears to be calculated based on literal time Can you clarify what you mean here? I presume each async request eats into your CPU budget. But you say: since you can burn a whole lot more AppEngine CPU more cheaply using the async api Can you clarify how that's the case? I would guess as long as you're being billed for the cpu-ms spent in your asynchronous calls, Google would let you hang yourself with them when it comes to billing.. :) so I presume they'd let you squeeze in as many as your original request, and its limit, will allow for? Thanks again. On Mar 16, 2:00 pm, David Wilson d...@botanicus.net wrote: It's completely undocumented (at this stage, anyway), but definitely seems to work. A few notes I've come gathered: - CPU time quota
[google-appengine] Re: Parallel urlfetch utility class / function.
I imagine keeping the request open until everything is done isn't going to go away any time soon, it's how http responses work and the scheduled tasks on the roadmap would be better suited to providing better support for that. I also agree on the batch put and get functionality for the most part is there. My experience from mass delete scripts has been delete is extremely heavy, and before the runtime length was extended, I came up with the number 75 being the safe amount of entities to delete in a request without encountering timeouts for the most part. I ended up using javascript with a simple protocol (responses of there's more and all done in order to delete 10k+ objects at a time). During that time I did notice that repeated writing to the datastore (or delete in my case) also caused other errors, which it looked like I was being throttled, so that's something else you may encounter if you continue to work on asynchronous datastore calls. On Mar 16, 1:12 pm, David Wilson d...@botanicus.net wrote: I forgot to mention, AppEngine does not close the request until all asynchronous requests have ended. This means it's not truly fire and forget. Regardless of whether you're waiting for a response or not, if a request is in progress, the HTTP response body is not returned to the client. I created a simple function this morning to call datastore_v3.Delete on a set of key objects, it appeared to work but I didn't test beyond ensuring the callback didn't receive an exception. Pretty untested code here: http://pastie.org/417496. For simple uses, it's probably not all that useful to call Datastore asynchronously is all that useful anyway, since unlike urlfetch, you can already minimize latency by making batch calls at the start/end of your request for all the keys you want to load/save. It's possibly useful to use it to concurrently commit a bunch of different transactions, but the code for this is less trivial than the urlfetch case. Probably best to see what the AppEngine team themselves provide for this. ;) David. 2009/3/16 bFlood bflood...@gmail.com: @joe - fire/forget - you can just skip the fetcher.wait() call (which call AsyncAPIProxy.wait). I'm not sure of you would need a valid callback but even if you did it could be a simple stub that does nothing. @david - have you made this work with datastore calls yet? having some issues trying to figure out how to set pbrequest/pbresponse variables cheers brian On Mar 16, 12:05 pm, Joe Bowman bowman.jos...@gmail.com wrote: Wow that's great. The SDK might be problematic for you, as it appears to be very single threaded, I know for a fact it can't reply to requests to itself. Out of curiosity, are you still using base urlfetch, or is it your own creation? While when Google releases their scheduled tasks functionality it will be less of an issue, if your solution had the ability to fire off urlfetch calls and not wait for a response, it could be a perfect fit for the gaeutilities cron utility. Currently it grabs a list of tasks it's supposed to run on request, sets a timestamp, runs one, the compares now() to the timestamp and if the timedelta is more than 1 second, stops running tasks and finishes the request. It already appears your project would be perfect for running all necessary tasks at once, and the MIT License I believe is compatible with the BSD license I've released gaeutilities, so would you have any personal objection to me including it in gaeutilities at some point, with proper attribution of course? If you haven't see that project, it's url ishttp://gaeutilities.appspot.com/ On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote: Joe, I've only tested it in production. ;) The code should work serially on the SDK, but I haven't tried yet. David. 2009/3/16 Joe Bowman bowman.jos...@gmail.com: Does the batch fetching working on live appengine applications, or only on the SDK? On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote: I have no idea how definitive this is, but literally it means wall clock time seems to be how CPU cost is measured. I guess this makes sense for a few different reasons. I found some internal function google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ est_cpu_usage with the docstring: Returns the number of megacycles used so far by this request. Does not include CPU used by API calls. Calling it, then running time.sleep(5), then calling it again, indicates thousands of megacycles used, yet in real terms the CPU was probably doing nothing. I guess Datastore CPU, etc., is added on top of this, but it seems to suggest to me that if you can drastically reduce request time, quota usage should drop too. I have yet to do any kind of rough measurements of Datastore CPU, so I'm not sure how correct this
[google-appengine] Re: Parallel urlfetch utility class / function.
thanks david. agreed on datastore except that unlike the current batch calls, you might be able to execute code concurrently on each response and then wait for all the worker's results. to me, and I could be wrong, even a no-op datastore request could serve as a poor man's worker thread. I'll see if I can get it working on our stuff and report back (did you happen to notice if all the threads were started on the same machine?) regardless, it will just be testing for right now. I'm sure the GAE team has their own ideas about whats allowed with async access. cheers and thanks again brian On Mar 16, 1:12 pm, David Wilson d...@botanicus.net wrote: I forgot to mention, AppEngine does not close the request until all asynchronous requests have ended. This means it's not truly fire and forget. Regardless of whether you're waiting for a response or not, if a request is in progress, the HTTP response body is not returned to the client. I created a simple function this morning to call datastore_v3.Delete on a set of key objects, it appeared to work but I didn't test beyond ensuring the callback didn't receive an exception. Pretty untested code here: http://pastie.org/417496. For simple uses, it's probably not all that useful to call Datastore asynchronously is all that useful anyway, since unlike urlfetch, you can already minimize latency by making batch calls at the start/end of your request for all the keys you want to load/save. It's possibly useful to use it to concurrently commit a bunch of different transactions, but the code for this is less trivial than the urlfetch case. Probably best to see what the AppEngine team themselves provide for this. ;) David. 2009/3/16 bFlood bflood...@gmail.com: @joe - fire/forget - you can just skip the fetcher.wait() call (which call AsyncAPIProxy.wait). I'm not sure of you would need a valid callback but even if you did it could be a simple stub that does nothing. @david - have you made this work with datastore calls yet? having some issues trying to figure out how to set pbrequest/pbresponse variables cheers brian On Mar 16, 12:05 pm, Joe Bowman bowman.jos...@gmail.com wrote: Wow that's great. The SDK might be problematic for you, as it appears to be very single threaded, I know for a fact it can't reply to requests to itself. Out of curiosity, are you still using base urlfetch, or is it your own creation? While when Google releases their scheduled tasks functionality it will be less of an issue, if your solution had the ability to fire off urlfetch calls and not wait for a response, it could be a perfect fit for the gaeutilities cron utility. Currently it grabs a list of tasks it's supposed to run on request, sets a timestamp, runs one, the compares now() to the timestamp and if the timedelta is more than 1 second, stops running tasks and finishes the request. It already appears your project would be perfect for running all necessary tasks at once, and the MIT License I believe is compatible with the BSD license I've released gaeutilities, so would you have any personal objection to me including it in gaeutilities at some point, with proper attribution of course? If you haven't see that project, it's url ishttp://gaeutilities.appspot.com/ On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote: Joe, I've only tested it in production. ;) The code should work serially on the SDK, but I haven't tried yet. David. 2009/3/16 Joe Bowman bowman.jos...@gmail.com: Does the batch fetching working on live appengine applications, or only on the SDK? On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote: I have no idea how definitive this is, but literally it means wall clock time seems to be how CPU cost is measured. I guess this makes sense for a few different reasons. I found some internal function google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ est_cpu_usage with the docstring: Returns the number of megacycles used so far by this request. Does not include CPU used by API calls. Calling it, then running time.sleep(5), then calling it again, indicates thousands of megacycles used, yet in real terms the CPU was probably doing nothing. I guess Datastore CPU, etc., is added on top of this, but it seems to suggest to me that if you can drastically reduce request time, quota usage should drop too. I have yet to do any kind of rough measurements of Datastore CPU, so I'm not sure how correct this all is. David. - One of the guys on IRC suggested this means that per-request cost is scaled during peak usage (and thus internal services running slower). 2009/3/16 peterk peter.ke...@gmail.com: A couple of questions re. CPU usage.. CPU time quota appears to be calculated based on literal time Can you clarify what