[google-appengine] Re: Parallel urlfetch utility class / function.

Joe Bowman Wed, 18 Mar 2009 07:04:53 -0700

Well, you'll never get a true parallel running of the callbacks, based
on the fact even if they're running in the same thread as the
urlfetch, each fetch will take a different amount of time. Though, I'm
not sure if the callbacks would run in the core thread or not. That's
where they'd be run if you see them running sequentially. I don't have
access to a machine to take the time to look at this until tonight,
and not sure even then I'd have the time. However, if I was to look at
it, I'd probably try these two approaches...


Since you're checking thread hashes, you could check the hash of the
thread the urlfetch uses, and see if the callback thread hash matches.

You could also do something like

Get a urlfetch-start-timestamp
urlfetch
Get a urlfetch-complete-timestamp
Get a callback-start-timestamp

Compare the urlfetch-start-timestamps to confirm they're all starting
at the same time.
Compare the urlfetch-complete-timestamps to the callback-start-
timestamps to see if the callback is indeed starting at the end of the
fetch.

On Mar 18, 8:11 am, bFlood <bflood...@gmail.com> wrote:
> hey david,joe
>
> I've got the async datastore Get working but I'm not sure the
> callbacks are being run on a background thread. they appear to be when
> you examine something like the thread local storage (hashes are all
> unique) but then if you insert just a simple time.sleep they appear to
> run serially. (note - while not completely new to async code, this is
> my first run with python so I'm not sure of the threading contentions
> of something like sleep or logging.debug)
>
> I would like to be able to run some code just after the fetch for each
> entity, the hope is that this would be run in parallel
>
> any thoughts?
>
> cheers
> brian
>
> On Mar 18, 6:14 am, Joe Bowman <bowman.jos...@gmail.com> wrote:
>
> > Ah ha.. thanks David.
>
> > And for the views, if I really wanted to launch everything at once, I
> > could map my boss, youtube, twitter, etc etc pulls to their own urls,
> > and use megafetch in my master view to pull those urls all at once
> > too.
>
> > On Mar 18, 5:14 am, David Wilson <d...@botanicus.net> wrote:
>
> > > Hey Joe,
>
> > > With the gdata package you can do something like this instead:
>
> > > As usual, completely untested code, but looks about right..
>
> > > from youtube import YouTubeVideoFeedFromString
>
> > > def get_feeds_async(usernames):
> > >     fetcher = megafetch.Fetcher()
> > >     output = {}
>
> > >     def cb(username, result):
> > >         if isinstance(output, Exception):
> > >             logging.error('could not fetch: %s', output)
> > >             content = None
> > >         else:
> > >             content = YouTubeVideoFeedFromString(result.content)
> > >         output[username] = content
>
> > >     for username in usernames:
> > >         url = 'http://gdata.youtube.com/feeds/api/users/%s/uploads'%\
> > >             (username,)
> > >         fetcher.start(url, lambda result: cb(username, result))
>
> > >     fetcher.wait()
> > >     return output
>
> > > feeds = get_feeds_async([ 'davemw', 'waverlyflams', 'googletechtalks',
> > >                           'TheOnion', 'winterelaxation' ])
>
> > > # feeds is now a mapping of usernames to YouTubeVideoFeed instances,
> > > or None if could not be fetched.
>
> > > 2009/3/18 Joe Bowman <bowman.jos...@gmail.com>:
>
> > > > This may be a really dumb question, but.. I'm still learning so...
>
> > > > Is there a way to do something other than a direct api call
> > > > asynchronously? I'm writing a script that pulls from multiple sources,
> > > > sometimes with higher level calls that use urlfetch, such as gdata.
> > > > Since I'm attempting to pull from multiple sources, and sometimes
> > > > multiple urls from each source, I'm trying to figure out if it's
> > > > possible to run other methods at the same time.
>
> > > > For example, I want to pull a youtube entry for several different
> > > > authors. The youtube api doesn't allow multiple authors in a request
> > > > (I have a enhancement request in for that though), so I need to do a
> > > > yt_service.GetYouTubeVideoFeed() for each author, then splice them
> > > > together into one feed. As I'm also working with Boss, and eventually
> > > > Twitter, I'll have feeds to pull from those sources as well.
>
> > > > My current application layout is using appengine-patch to provide
> > > > django. I've set up a Boss and Youtube "model" with get methods that
> > > > handle getting the data. So I can do something similar to:
>
> > > > web_results = models.Boss.get(request.GET['term'], start=start)
> > > > news_results = models.Boss.get(request.GET['term'], vertical="news",
> > > > start=start)
> > > > youtube = models.Youtube.get(request.GET['term'], start=start)
>
> > > > Ideally, I'd like some of those models to be able to do asynchronous
> > > > tasks within their get function, and then also, I'd like to run the
> > > > above requests at the same, which should really speed the request up.
>
> > > > On Mar 17, 9:20 am, Joe Bowman <bowman.jos...@gmail.com> wrote:
> > > >> Thanks,
>
> > > >> I'm going to give it a go for urlfetch calls for one project I'm
> > > >> working on this week.
>
> > > >> Not sure when I'd be able to include it in gaeutiltiies for cron and
> > > >> such, that project is currently lower on my priority list at the
> > > >> moment, but can't wait until I get a chance to play with it. Another
> > > >> idea I had for it is the ROTmodel (retry on timeout model) in the
> > > >> project, which could speed that process up.
>
> > > >> On Mar 17, 9:11 am, David Wilson <d...@botanicus.net> wrote:
>
> > > >> > 2009/3/16 Joe Bowman <bowman.jos...@gmail.com>:
>
> > > >> > > Wow that's great. The SDK might be problematic for you, as it 
> > > >> > > appears
> > > >> > > to be very single threaded, I know for a fact it can't reply to
> > > >> > > requests to itself.
>
> > > >> > > Out of curiosity, are you still using base urlfetch, or is it your 
> > > >> > > own
> > > >> > > creation? While when Google releases their scheduled tasks
> > > >> > > functionality it will be less of an issue, if your solution had the
> > > >> > > ability to fire off urlfetch calls and not wait for a response, it
> > > >> > > could be a perfect fit for the gaeutilities cron utility.
>
> > > >> > > Currently it grabs a list of tasks it's supposed to run on request,
> > > >> > > sets a timestamp, runs one, the compares now() to the timestamp 
> > > >> > > and if
> > > >> > > the timedelta is more than 1 second, stops running tasks and 
> > > >> > > finishes
> > > >> > > the request. It already appears your project would be perfect for
> > > >> > > running all necessary tasks at once, and the MIT License I believe 
> > > >> > > is
> > > >> > > compatible with the BSD license I've released gaeutilities, so 
> > > >> > > would
> > > >> > > you have any personal objection to me including it in gaeutilities 
> > > >> > > at
> > > >> > > some point, with proper attribution of course?
>
> > > >> > Sorry I missed this in the first reply - yeah work away! :)
>
> > > >> > David
>
> > > >> > > If you haven't see that project, it's url 
> > > >> > > ishttp://gaeutilities.appspot.com/
>
> > > >> > > On Mar 16, 11:03 am, David Wilson <d...@botanicus.net> wrote:
> > > >> > >> Joe,
>
> > > >> > >> I've only tested it in production. ;)
>
> > > >> > >> The code should work serially on the SDK, but I haven't tried yet.
>
> > > >> > >> David.
>
> > > >> > >> 2009/3/16 Joe Bowman <bowman.jos...@gmail.com>:
>
> > > >> > >> > Does the batch fetching working on live appengine applications, 
> > > >> > >> > or
> > > >> > >> > only on the SDK?
>
> > > >> > >> > On Mar 16, 10:19 am, David Wilson <d...@botanicus.net> wrote:
> > > >> > >> >> I have no idea how definitive this is, but literally it means 
> > > >> > >> >> wall
> > > >> > >> >> clock time seems to be how CPU cost is measured. I guess this 
> > > >> > >> >> makes
> > > >> > >> >> sense for a few different reasons.
>
> > > >> > >> >> I found some internal function
> > > >> > >> >> "google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ
> > > >> > >> >>  est_cpu_usage"
> > > >> > >> >> with the docstring:
>
> > > >> > >> >>     Returns the number of megacycles used so far by this 
> > > >> > >> >> request.
> > > >> > >> >>     Does not include CPU used by API calls.
>
> > > >> > >> >> Calling it, then running time.sleep(5), then calling it again,
> > > >> > >> >> indicates thousands of megacycles used, yet in real terms the 
> > > >> > >> >> CPU was
> > > >> > >> >> probably doing nothing. I guess Datastore CPU, etc., is added 
> > > >> > >> >> on top
> > > >> > >> >> of this, but it seems to suggest to me that if you can 
> > > >> > >> >> drastically
> > > >> > >> >> reduce request time, quota usage should drop too.
>
> > > >> > >> >> I have yet to do any kind of rough measurements of Datastore 
> > > >> > >> >> CPU, so
> > > >> > >> >> I'm not sure how correct this all is.
>
> > > >> > >> >> David.
>
> > > >> > >> >>  - One of the guys on IRC suggested this means that 
> > > >> > >> >> per-request cost
> > > >> > >> >> is scaled during peak usage (and thus internal services running
> > > >> > >> >> slower).
>
> > > >> > >> >> 2009/3/16 peterk <peter.ke...@gmail.com>:
>
> > > >> > >> >> > A couple of questions re. CPU usage..
>
> > > >> > >> >> > "CPU time quota appears to be calculated based on literal 
> > > >> > >> >> > time"
>
> > > >> > >> >> > Can you clarify what you mean here? I presume each async 
> > > >> > >> >> > request eats
> > > >> > >> >> > into your CPU budget. But you say:
>
> > > >> > >> >> > "since you can burn a whole lot more AppEngine CPU more 
> > > >> > >> >> > cheaply using
> > > >> > >> >> > the async api"
>
> > > >> > >> >> > Can you clarify how that's the case?
>
> > > >> > >> >> > I would guess as long as you're being billed for the cpu-ms 
> > > >> > >> >> > spent in
> > > >> > >> >> > your asynchronous calls, Google would let you hang yourself 
> > > >> > >> >> > with them
> > > >> > >> >> > when it comes to billing.. :) so I presume they'd let you 
> > > >> > >> >> > squeeze in
> > > >> > >> >> > as many as your original request, and its limit, will allow 
> > > >> > >> >> > for?
>
> > > >> > >> >> > Thanks again.
>
> > > >> > >> >> > On Mar 16, 2:00 pm, David Wilson <d...@botanicus.net> wrote:
> > > >> > >> >> >> It's completely undocumented (at this stage, anyway), but 
> > > >> > >> >> >> definitely
> > > >> > >> >> >> seems to work. A few notes I've come gathered:
>
> > > >> > >> >> >>  - CPU time quota appears to be calculated based on literal 
> > > >> > >> >> >> time,
> > > >> > >> >> >> rather than e.g. the UNIX concept of "time spent in running 
> > > >> > >> >> >> state".
>
> > > >> > >> >> >>  - I can fetch 100 URLs in 1.3 seconds from a machine 
> > > >> > >> >> >> colocated in
> > > >> > >> >> >> Germany using the asynchronous API. I can't begin to 
> > > >> > >> >> >> imagine how slow
> > > >> > >> >> >> (and therefore expensive in monetary terms) this would be 
> > > >> > >> >> >> using the
> > > >> > >> >> >> standard API.
>
> > > >> > >> >> >>  - The user-specified callback function appears to be 
> > > >> > >> >> >> invoked in a
> > > >> > >> >> >> separate thread; the RPC isn't "complete" until this 
> > > >> > >> >> >> callback
> > > >> > >> >> >> completes. The callback thread is still subject to the 
> > > >> > >> >> >> request
> > > >> > >> >> >> deadline.
>
> > > >> > >> >> >>  - It's a standard interface, and seems to have no parallel
> > > >> > >> >> >> restrictions at least for urlfetch and Datastore. However, 
> > > >> > >> >> >> I imagine
> > > >> > >> >> >> that it's possible restrictions may be placed here
>
> ...
>
> read more »
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

[google-appengine] Re: Parallel urlfetch utility class / function.

Reply via email to