[google-appengine] Re: Parallel urlfetch utility class / function.

David Wilson Wed, 18 Mar 2009 02:19:23 -0700

Hey Joe,

With the gdata package you can do something like this instead:



As usual, completely untested code, but looks about right..


from youtube import YouTubeVideoFeedFromString


def get_feeds_async(usernames):
    fetcher = megafetch.Fetcher()
    output = {}

    def cb(username, result):
        if isinstance(output, Exception):
            logging.error('could not fetch: %s', output)
            content = None
        else:
            content = YouTubeVideoFeedFromString(result.content)
        output[username] = content

    for username in usernames:
        url = 'http://gdata.youtube.com/feeds/api/users/%s/uploads' %\
            (username,)
        fetcher.start(url, lambda result: cb(username, result))

    fetcher.wait()
    return output


feeds = get_feeds_async([ 'davemw', 'waverlyflams', 'googletechtalks',
                          'TheOnion', 'winterelaxation' ])

# feeds is now a mapping of usernames to YouTubeVideoFeed instances,
or None if could not be fetched.


2009/3/18 Joe Bowman <bowman.jos...@gmail.com>:
>
> This may be a really dumb question, but.. I'm still learning so...
>
> Is there a way to do something other than a direct api call
> asynchronously? I'm writing a script that pulls from multiple sources,
> sometimes with higher level calls that use urlfetch, such as gdata.
> Since I'm attempting to pull from multiple sources, and sometimes
> multiple urls from each source, I'm trying to figure out if it's
> possible to run other methods at the same time.
>
> For example, I want to pull a youtube entry for several different
> authors. The youtube api doesn't allow multiple authors in a request
> (I have a enhancement request in for that though), so I need to do a
> yt_service.GetYouTubeVideoFeed() for each author, then splice them
> together into one feed. As I'm also working with Boss, and eventually
> Twitter, I'll have feeds to pull from those sources as well.
>
> My current application layout is using appengine-patch to provide
> django. I've set up a Boss and Youtube "model" with get methods that
> handle getting the data. So I can do something similar to:
>
> web_results = models.Boss.get(request.GET['term'], start=start)
> news_results = models.Boss.get(request.GET['term'], vertical="news",
> start=start)
> youtube = models.Youtube.get(request.GET['term'], start=start)
>
> Ideally, I'd like some of those models to be able to do asynchronous
> tasks within their get function, and then also, I'd like to run the
> above requests at the same, which should really speed the request up.
>
>
> On Mar 17, 9:20 am, Joe Bowman <bowman.jos...@gmail.com> wrote:
>> Thanks,
>>
>> I'm going to give it a go for urlfetch calls for one project I'm
>> working on this week.
>>
>> Not sure when I'd be able to include it in gaeutiltiies for cron and
>> such, that project is currently lower on my priority list at the
>> moment, but can't wait until I get a chance to play with it. Another
>> idea I had for it is the ROTmodel (retry on timeout model) in the
>> project, which could speed that process up.
>>
>> On Mar 17, 9:11 am, David Wilson <d...@botanicus.net> wrote:
>>
>> > 2009/3/16 Joe Bowman <bowman.jos...@gmail.com>:
>>
>> > > Wow that's great. The SDK might be problematic for you, as it appears
>> > > to be very single threaded, I know for a fact it can't reply to
>> > > requests to itself.
>>
>> > > Out of curiosity, are you still using base urlfetch, or is it your own
>> > > creation? While when Google releases their scheduled tasks
>> > > functionality it will be less of an issue, if your solution had the
>> > > ability to fire off urlfetch calls and not wait for a response, it
>> > > could be a perfect fit for the gaeutilities cron utility.
>>
>> > > Currently it grabs a list of tasks it's supposed to run on request,
>> > > sets a timestamp, runs one, the compares now() to the timestamp and if
>> > > the timedelta is more than 1 second, stops running tasks and finishes
>> > > the request. It already appears your project would be perfect for
>> > > running all necessary tasks at once, and the MIT License I believe is
>> > > compatible with the BSD license I've released gaeutilities, so would
>> > > you have any personal objection to me including it in gaeutilities at
>> > > some point, with proper attribution of course?
>>
>> > Sorry I missed this in the first reply - yeah work away! :)
>>
>> > David
>>
>> > > If you haven't see that project, it's url 
>> > > ishttp://gaeutilities.appspot.com/
>>
>> > > On Mar 16, 11:03 am, David Wilson <d...@botanicus.net> wrote:
>> > >> Joe,
>>
>> > >> I've only tested it in production. ;)
>>
>> > >> The code should work serially on the SDK, but I haven't tried yet.
>>
>> > >> David.
>>
>> > >> 2009/3/16 Joe Bowman <bowman.jos...@gmail.com>:
>>
>> > >> > Does the batch fetching working on live appengine applications, or
>> > >> > only on the SDK?
>>
>> > >> > On Mar 16, 10:19 am, David Wilson <d...@botanicus.net> wrote:
>> > >> >> I have no idea how definitive this is, but literally it means wall
>> > >> >> clock time seems to be how CPU cost is measured. I guess this makes
>> > >> >> sense for a few different reasons.
>>
>> > >> >> I found some internal function
>> > >> >> "google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ
>> > >> >>  est_cpu_usage"
>> > >> >> with the docstring:
>>
>> > >> >>     Returns the number of megacycles used so far by this request.
>> > >> >>     Does not include CPU used by API calls.
>>
>> > >> >> Calling it, then running time.sleep(5), then calling it again,
>> > >> >> indicates thousands of megacycles used, yet in real terms the CPU was
>> > >> >> probably doing nothing. I guess Datastore CPU, etc., is added on top
>> > >> >> of this, but it seems to suggest to me that if you can drastically
>> > >> >> reduce request time, quota usage should drop too.
>>
>> > >> >> I have yet to do any kind of rough measurements of Datastore CPU, so
>> > >> >> I'm not sure how correct this all is.
>>
>> > >> >> David.
>>
>> > >> >>  - One of the guys on IRC suggested this means that per-request cost
>> > >> >> is scaled during peak usage (and thus internal services running
>> > >> >> slower).
>>
>> > >> >> 2009/3/16 peterk <peter.ke...@gmail.com>:
>>
>> > >> >> > A couple of questions re. CPU usage..
>>
>> > >> >> > "CPU time quota appears to be calculated based on literal time"
>>
>> > >> >> > Can you clarify what you mean here? I presume each async request 
>> > >> >> > eats
>> > >> >> > into your CPU budget. But you say:
>>
>> > >> >> > "since you can burn a whole lot more AppEngine CPU more cheaply 
>> > >> >> > using
>> > >> >> > the async api"
>>
>> > >> >> > Can you clarify how that's the case?
>>
>> > >> >> > I would guess as long as you're being billed for the cpu-ms spent 
>> > >> >> > in
>> > >> >> > your asynchronous calls, Google would let you hang yourself with 
>> > >> >> > them
>> > >> >> > when it comes to billing.. :) so I presume they'd let you squeeze 
>> > >> >> > in
>> > >> >> > as many as your original request, and its limit, will allow for?
>>
>> > >> >> > Thanks again.
>>
>> > >> >> > On Mar 16, 2:00 pm, David Wilson <d...@botanicus.net> wrote:
>> > >> >> >> It's completely undocumented (at this stage, anyway), but 
>> > >> >> >> definitely
>> > >> >> >> seems to work. A few notes I've come gathered:
>>
>> > >> >> >>  - CPU time quota appears to be calculated based on literal time,
>> > >> >> >> rather than e.g. the UNIX concept of "time spent in running 
>> > >> >> >> state".
>>
>> > >> >> >>  - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
>> > >> >> >> Germany using the asynchronous API. I can't begin to imagine how 
>> > >> >> >> slow
>> > >> >> >> (and therefore expensive in monetary terms) this would be using 
>> > >> >> >> the
>> > >> >> >> standard API.
>>
>> > >> >> >>  - The user-specified callback function appears to be invoked in a
>> > >> >> >> separate thread; the RPC isn't "complete" until this callback
>> > >> >> >> completes. The callback thread is still subject to the request
>> > >> >> >> deadline.
>>
>> > >> >> >>  - It's a standard interface, and seems to have no parallel
>> > >> >> >> restrictions at least for urlfetch and Datastore. However, I 
>> > >> >> >> imagine
>> > >> >> >> that it's possible restrictions may be placed here at some later
>> > >> >> >> stage, since you can burn a whole lot more AppEngine CPU more 
>> > >> >> >> cheaply
>> > >> >> >> using the async api.
>>
>> > >> >> >>  - It's "standard" only insomuch as you have to fiddle with
>> > >> >> >> AppEngine-internal protocolbuffer definitions for each service 
>> > >> >> >> type.
>> > >> >> >> This mostly means copy-pasting the standard sync call code from 
>> > >> >> >> the
>> > >> >> >> SDK, and hacking it to use pubsubhubub's proxy code.
>>
>> > >> >> >> Per the last point, you might be better waiting for an officially
>> > >> >> >> sanctioned API for doing this, albeit I doubt the protocolbuffer
>> > >> >> >> definitions change all that often.
>>
>> > >> >> >> Thanks for Brett Slatkin & co. for doing the digging required to 
>> > >> >> >> get
>> > >> >> >> the async stuff working! :)
>>
>> > >> >> >> David.
>>
>> > >> >> >> 2009/3/16 peterk <peter.ke...@gmail.com>:
>>
>> > >> >> >> > Very neat.. Thank you.
>>
>> > >> >> >> > Just to clarify, can we use this for all API calls? Datastore 
>> > >> >> >> > too? I
>> > >> >> >> > didn't look very closely at the async proxy in pubsubhubub..
>>
>> > >> >> >> > Asynchronous calls available on all apis might give a lot to 
>> > >> >> >> > chew
>> > >> >> >> > on.. :) It's been a while since I've worked with async function 
>> > >> >> >> > calls
>> > >> >> >> > or threading, might have to dig up some old notes to see where 
>> > >> >> >> > I could
>> > >> >> >> > extract gains from it in my app. Some common cases might be 
>> > >> >> >> > worth the
>> > >> >> >> > community documenting for all to benefit from, too.
>>
>> > >> >> >> > On Mar 16, 1:26 pm, David Wilson <d...@botanicus.net> wrote:
>> > >> >> >> >> I've created a Google Code project to contain some batch 
>> > >> >> >> >> utilities I'm
>> > >> >> >> >> working on, based on async_apiproxy.py from pubsubhubbub[0]. 
>> > >> >> >> >> The
>> > >> >> >> >> project currently contains just a modified async_apiproxy.py 
>> > >> >> >> >> that
>> > >> >> >> >> doesn't require dummy google3 modules on the local machine, 
>> > >> >> >> >> and a
>> > >> >> >> >> megafetch.py, for batch-fetching URLs.
>>
>> > >> >> >> >>    http://code.google.com/p/appengine-async-tools/
>>
>> > >> >> >> >> David
>>
>> > >> >> >> >> [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a...
>>
>> > >> >> >> >> --
>> > >> >> >> >> It is better to be wrong than to be vague.
>> > >> >> >> >>   — Freeman Dyson
>>
>> > >> >> >> --
>> > >> >> >> It is better to be wrong than to be vague.
>> > >> >> >>   — Freeman Dyson
>>
>> > >> >> --
>> > >> >> It is better to be wrong than to be vague.
>> > >> >>   — Freeman Dyson
>>
>> > >> --
>> > >> It is better to be wrong than to be vague.
>> > >>   — Freeman Dyson
>>
>> > --
>> > It is better to be wrong than to be vague.
>> >   — Freeman Dyson
> >
>



-- 
It is better to be wrong than to be vague.
  — Freeman Dyson

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

[google-appengine] Re: Parallel urlfetch utility class / function.

Reply via email to