[google-appengine] Re: Parallel urlfetch utility class / function.

Joe Bowman Wed, 18 Mar 2009 03:15:13 -0700

Ah ha.. thanks David.

And for the views, if I really wanted to launch everything at once, I
could map my boss, youtube, twitter, etc etc pulls to their own urls,
and use megafetch in my master view to pull those urls all at once
too.


On Mar 18, 5:14 am, David Wilson <d...@botanicus.net> wrote:
> Hey Joe,
>
> With the gdata package you can do something like this instead:
>
> As usual, completely untested code, but looks about right..
>
> from youtube import YouTubeVideoFeedFromString
>
> def get_feeds_async(usernames):
>     fetcher = megafetch.Fetcher()
>     output = {}
>
>     def cb(username, result):
>         if isinstance(output, Exception):
>             logging.error('could not fetch: %s', output)
>             content = None
>         else:
>             content = YouTubeVideoFeedFromString(result.content)
>         output[username] = content
>
>     for username in usernames:
>         url = 'http://gdata.youtube.com/feeds/api/users/%s/uploads'%\
>             (username,)
>         fetcher.start(url, lambda result: cb(username, result))
>
>     fetcher.wait()
>     return output
>
> feeds = get_feeds_async([ 'davemw', 'waverlyflams', 'googletechtalks',
>                           'TheOnion', 'winterelaxation' ])
>
> # feeds is now a mapping of usernames to YouTubeVideoFeed instances,
> or None if could not be fetched.
>
> 2009/3/18 Joe Bowman <bowman.jos...@gmail.com>:
>
>
>
> > This may be a really dumb question, but.. I'm still learning so...
>
> > Is there a way to do something other than a direct api call
> > asynchronously? I'm writing a script that pulls from multiple sources,
> > sometimes with higher level calls that use urlfetch, such as gdata.
> > Since I'm attempting to pull from multiple sources, and sometimes
> > multiple urls from each source, I'm trying to figure out if it's
> > possible to run other methods at the same time.
>
> > For example, I want to pull a youtube entry for several different
> > authors. The youtube api doesn't allow multiple authors in a request
> > (I have a enhancement request in for that though), so I need to do a
> > yt_service.GetYouTubeVideoFeed() for each author, then splice them
> > together into one feed. As I'm also working with Boss, and eventually
> > Twitter, I'll have feeds to pull from those sources as well.
>
> > My current application layout is using appengine-patch to provide
> > django. I've set up a Boss and Youtube "model" with get methods that
> > handle getting the data. So I can do something similar to:
>
> > web_results = models.Boss.get(request.GET['term'], start=start)
> > news_results = models.Boss.get(request.GET['term'], vertical="news",
> > start=start)
> > youtube = models.Youtube.get(request.GET['term'], start=start)
>
> > Ideally, I'd like some of those models to be able to do asynchronous
> > tasks within their get function, and then also, I'd like to run the
> > above requests at the same, which should really speed the request up.
>
> > On Mar 17, 9:20 am, Joe Bowman <bowman.jos...@gmail.com> wrote:
> >> Thanks,
>
> >> I'm going to give it a go for urlfetch calls for one project I'm
> >> working on this week.
>
> >> Not sure when I'd be able to include it in gaeutiltiies for cron and
> >> such, that project is currently lower on my priority list at the
> >> moment, but can't wait until I get a chance to play with it. Another
> >> idea I had for it is the ROTmodel (retry on timeout model) in the
> >> project, which could speed that process up.
>
> >> On Mar 17, 9:11 am, David Wilson <d...@botanicus.net> wrote:
>
> >> > 2009/3/16 Joe Bowman <bowman.jos...@gmail.com>:
>
> >> > > Wow that's great. The SDK might be problematic for you, as it appears
> >> > > to be very single threaded, I know for a fact it can't reply to
> >> > > requests to itself.
>
> >> > > Out of curiosity, are you still using base urlfetch, or is it your own
> >> > > creation? While when Google releases their scheduled tasks
> >> > > functionality it will be less of an issue, if your solution had the
> >> > > ability to fire off urlfetch calls and not wait for a response, it
> >> > > could be a perfect fit for the gaeutilities cron utility.
>
> >> > > Currently it grabs a list of tasks it's supposed to run on request,
> >> > > sets a timestamp, runs one, the compares now() to the timestamp and if
> >> > > the timedelta is more than 1 second, stops running tasks and finishes
> >> > > the request. It already appears your project would be perfect for
> >> > > running all necessary tasks at once, and the MIT License I believe is
> >> > > compatible with the BSD license I've released gaeutilities, so would
> >> > > you have any personal objection to me including it in gaeutilities at
> >> > > some point, with proper attribution of course?
>
> >> > Sorry I missed this in the first reply - yeah work away! :)
>
> >> > David
>
> >> > > If you haven't see that project, it's url 
> >> > > ishttp://gaeutilities.appspot.com/
>
> >> > > On Mar 16, 11:03 am, David Wilson <d...@botanicus.net> wrote:
> >> > >> Joe,
>
> >> > >> I've only tested it in production. ;)
>
> >> > >> The code should work serially on the SDK, but I haven't tried yet.
>
> >> > >> David.
>
> >> > >> 2009/3/16 Joe Bowman <bowman.jos...@gmail.com>:
>
> >> > >> > Does the batch fetching working on live appengine applications, or
> >> > >> > only on the SDK?
>
> >> > >> > On Mar 16, 10:19 am, David Wilson <d...@botanicus.net> wrote:
> >> > >> >> I have no idea how definitive this is, but literally it means wall
> >> > >> >> clock time seems to be how CPU cost is measured. I guess this makes
> >> > >> >> sense for a few different reasons.
>
> >> > >> >> I found some internal function
> >> > >> >> "google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ
> >> > >> >>  est_cpu_usage"
> >> > >> >> with the docstring:
>
> >> > >> >>     Returns the number of megacycles used so far by this request.
> >> > >> >>     Does not include CPU used by API calls.
>
> >> > >> >> Calling it, then running time.sleep(5), then calling it again,
> >> > >> >> indicates thousands of megacycles used, yet in real terms the CPU 
> >> > >> >> was
> >> > >> >> probably doing nothing. I guess Datastore CPU, etc., is added on 
> >> > >> >> top
> >> > >> >> of this, but it seems to suggest to me that if you can drastically
> >> > >> >> reduce request time, quota usage should drop too.
>
> >> > >> >> I have yet to do any kind of rough measurements of Datastore CPU, 
> >> > >> >> so
> >> > >> >> I'm not sure how correct this all is.
>
> >> > >> >> David.
>
> >> > >> >>  - One of the guys on IRC suggested this means that per-request 
> >> > >> >> cost
> >> > >> >> is scaled during peak usage (and thus internal services running
> >> > >> >> slower).
>
> >> > >> >> 2009/3/16 peterk <peter.ke...@gmail.com>:
>
> >> > >> >> > A couple of questions re. CPU usage..
>
> >> > >> >> > "CPU time quota appears to be calculated based on literal time"
>
> >> > >> >> > Can you clarify what you mean here? I presume each async request 
> >> > >> >> > eats
> >> > >> >> > into your CPU budget. But you say:
>
> >> > >> >> > "since you can burn a whole lot more AppEngine CPU more cheaply 
> >> > >> >> > using
> >> > >> >> > the async api"
>
> >> > >> >> > Can you clarify how that's the case?
>
> >> > >> >> > I would guess as long as you're being billed for the cpu-ms 
> >> > >> >> > spent in
> >> > >> >> > your asynchronous calls, Google would let you hang yourself with 
> >> > >> >> > them
> >> > >> >> > when it comes to billing.. :) so I presume they'd let you 
> >> > >> >> > squeeze in
> >> > >> >> > as many as your original request, and its limit, will allow for?
>
> >> > >> >> > Thanks again.
>
> >> > >> >> > On Mar 16, 2:00 pm, David Wilson <d...@botanicus.net> wrote:
> >> > >> >> >> It's completely undocumented (at this stage, anyway), but 
> >> > >> >> >> definitely
> >> > >> >> >> seems to work. A few notes I've come gathered:
>
> >> > >> >> >>  - CPU time quota appears to be calculated based on literal 
> >> > >> >> >> time,
> >> > >> >> >> rather than e.g. the UNIX concept of "time spent in running 
> >> > >> >> >> state".
>
> >> > >> >> >>  - I can fetch 100 URLs in 1.3 seconds from a machine colocated 
> >> > >> >> >> in
> >> > >> >> >> Germany using the asynchronous API. I can't begin to imagine 
> >> > >> >> >> how slow
> >> > >> >> >> (and therefore expensive in monetary terms) this would be using 
> >> > >> >> >> the
> >> > >> >> >> standard API.
>
> >> > >> >> >>  - The user-specified callback function appears to be invoked 
> >> > >> >> >> in a
> >> > >> >> >> separate thread; the RPC isn't "complete" until this callback
> >> > >> >> >> completes. The callback thread is still subject to the request
> >> > >> >> >> deadline.
>
> >> > >> >> >>  - It's a standard interface, and seems to have no parallel
> >> > >> >> >> restrictions at least for urlfetch and Datastore. However, I 
> >> > >> >> >> imagine
> >> > >> >> >> that it's possible restrictions may be placed here at some later
> >> > >> >> >> stage, since you can burn a whole lot more AppEngine CPU more 
> >> > >> >> >> cheaply
> >> > >> >> >> using the async api.
>
> >> > >> >> >>  - It's "standard" only insomuch as you have to fiddle with
> >> > >> >> >> AppEngine-internal protocolbuffer definitions for each service 
> >> > >> >> >> type.
> >> > >> >> >> This mostly means copy-pasting the standard sync call code from 
> >> > >> >> >> the
> >> > >> >> >> SDK, and hacking it to use pubsubhubub's proxy code.
>
> >> > >> >> >> Per the last point, you might be better waiting for an 
> >> > >> >> >> officially
> >> > >> >> >> sanctioned API for doing this, albeit I doubt the protocolbuffer
> >> > >> >> >> definitions change all that often.
>
> >> > >> >> >> Thanks for Brett Slatkin & co. for doing the digging required 
> >> > >> >> >> to get
> >> > >> >> >> the async stuff working! :)
>
> >> > >> >> >> David.
>
> >> > >> >> >> 2009/3/16 peterk <peter.ke...@gmail.com>:
>
> >> > >> >> >> > Very neat.. Thank you.
>
> >> > >> >> >> > Just to clarify, can we use this for all API calls? Datastore 
> >> > >> >> >> > too? I
> >> > >> >> >> > didn't look very closely at the async proxy in pubsubhubub..
>
> >> > >> >> >> > Asynchronous calls available on all apis might give a lot to 
> >> > >> >> >> > chew
> >> > >> >> >> > on.. :) It's been a while since I've worked with async 
> >> > >> >> >> > function calls
> >> > >> >> >> > or threading, might have to dig up some old notes to see 
> >> > >> >> >> > where I could
> >> > >> >> >> > extract gains from it in my app. Some common cases might be 
> >> > >> >> >> > worth the
> >> > >> >> >> > community documenting for all to benefit from, too.
>
> >> > >> >> >> > On Mar 16, 1:26 pm, David Wilson <d...@botanicus.net> wrote:
>
> ...
>
> read more »
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

[google-appengine] Re: Parallel urlfetch utility class / function.

Reply via email to