[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-18 Thread David Wilson

Hey Joe,

With the gdata package you can do something like this instead:


As usual, completely untested code, but looks about right..


from youtube import YouTubeVideoFeedFromString


def get_feeds_async(usernames):
fetcher = megafetch.Fetcher()
output = {}

def cb(username, result):
if isinstance(output, Exception):
logging.error('could not fetch: %s', output)
content = None
else:
content = YouTubeVideoFeedFromString(result.content)
output[username] = content

for username in usernames:
url = 'http://gdata.youtube.com/feeds/api/users/%s/uploads' %\
(username,)
fetcher.start(url, lambda result: cb(username, result))

fetcher.wait()
return output


feeds = get_feeds_async([ 'davemw', 'waverlyflams', 'googletechtalks',
  'TheOnion', 'winterelaxation' ])

# feeds is now a mapping of usernames to YouTubeVideoFeed instances,
or None if could not be fetched.


2009/3/18 Joe Bowman bowman.jos...@gmail.com:

 This may be a really dumb question, but.. I'm still learning so...

 Is there a way to do something other than a direct api call
 asynchronously? I'm writing a script that pulls from multiple sources,
 sometimes with higher level calls that use urlfetch, such as gdata.
 Since I'm attempting to pull from multiple sources, and sometimes
 multiple urls from each source, I'm trying to figure out if it's
 possible to run other methods at the same time.

 For example, I want to pull a youtube entry for several different
 authors. The youtube api doesn't allow multiple authors in a request
 (I have a enhancement request in for that though), so I need to do a
 yt_service.GetYouTubeVideoFeed() for each author, then splice them
 together into one feed. As I'm also working with Boss, and eventually
 Twitter, I'll have feeds to pull from those sources as well.

 My current application layout is using appengine-patch to provide
 django. I've set up a Boss and Youtube model with get methods that
 handle getting the data. So I can do something similar to:

 web_results = models.Boss.get(request.GET['term'], start=start)
 news_results = models.Boss.get(request.GET['term'], vertical=news,
 start=start)
 youtube = models.Youtube.get(request.GET['term'], start=start)

 Ideally, I'd like some of those models to be able to do asynchronous
 tasks within their get function, and then also, I'd like to run the
 above requests at the same, which should really speed the request up.


 On Mar 17, 9:20 am, Joe Bowman bowman.jos...@gmail.com wrote:
 Thanks,

 I'm going to give it a go for urlfetch calls for one project I'm
 working on this week.

 Not sure when I'd be able to include it in gaeutiltiies for cron and
 such, that project is currently lower on my priority list at the
 moment, but can't wait until I get a chance to play with it. Another
 idea I had for it is the ROTmodel (retry on timeout model) in the
 project, which could speed that process up.

 On Mar 17, 9:11 am, David Wilson d...@botanicus.net wrote:

  2009/3/16 Joe Bowman bowman.jos...@gmail.com:

   Wow that's great. The SDK might be problematic for you, as it appears
   to be very single threaded, I know for a fact it can't reply to
   requests to itself.

   Out of curiosity, are you still using base urlfetch, or is it your own
   creation? While when Google releases their scheduled tasks
   functionality it will be less of an issue, if your solution had the
   ability to fire off urlfetch calls and not wait for a response, it
   could be a perfect fit for the gaeutilities cron utility.

   Currently it grabs a list of tasks it's supposed to run on request,
   sets a timestamp, runs one, the compares now() to the timestamp and if
   the timedelta is more than 1 second, stops running tasks and finishes
   the request. It already appears your project would be perfect for
   running all necessary tasks at once, and the MIT License I believe is
   compatible with the BSD license I've released gaeutilities, so would
   you have any personal objection to me including it in gaeutilities at
   some point, with proper attribution of course?

  Sorry I missed this in the first reply - yeah work away! :)

  David

   If you haven't see that project, it's url 
   ishttp://gaeutilities.appspot.com/

   On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote:
   Joe,

   I've only tested it in production. ;)

   The code should work serially on the SDK, but I haven't tried yet.

   David.

   2009/3/16 Joe Bowman bowman.jos...@gmail.com:

Does the batch fetching working on live appengine applications, or
only on the SDK?

On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote:
I have no idea how definitive this is, but literally it means wall
clock time seems to be how CPU cost is measured. I guess this makes
sense for a few different reasons.

I found some internal function

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-18 Thread Joe Bowman

Ah ha.. thanks David.

And for the views, if I really wanted to launch everything at once, I
could map my boss, youtube, twitter, etc etc pulls to their own urls,
and use megafetch in my master view to pull those urls all at once
too.

On Mar 18, 5:14 am, David Wilson d...@botanicus.net wrote:
 Hey Joe,

 With the gdata package you can do something like this instead:

 As usual, completely untested code, but looks about right..

 from youtube import YouTubeVideoFeedFromString

 def get_feeds_async(usernames):
     fetcher = megafetch.Fetcher()
     output = {}

     def cb(username, result):
         if isinstance(output, Exception):
             logging.error('could not fetch: %s', output)
             content = None
         else:
             content = YouTubeVideoFeedFromString(result.content)
         output[username] = content

     for username in usernames:
         url = 'http://gdata.youtube.com/feeds/api/users/%s/uploads'%\
             (username,)
         fetcher.start(url, lambda result: cb(username, result))

     fetcher.wait()
     return output

 feeds = get_feeds_async([ 'davemw', 'waverlyflams', 'googletechtalks',
                           'TheOnion', 'winterelaxation' ])

 # feeds is now a mapping of usernames to YouTubeVideoFeed instances,
 or None if could not be fetched.

 2009/3/18 Joe Bowman bowman.jos...@gmail.com:



  This may be a really dumb question, but.. I'm still learning so...

  Is there a way to do something other than a direct api call
  asynchronously? I'm writing a script that pulls from multiple sources,
  sometimes with higher level calls that use urlfetch, such as gdata.
  Since I'm attempting to pull from multiple sources, and sometimes
  multiple urls from each source, I'm trying to figure out if it's
  possible to run other methods at the same time.

  For example, I want to pull a youtube entry for several different
  authors. The youtube api doesn't allow multiple authors in a request
  (I have a enhancement request in for that though), so I need to do a
  yt_service.GetYouTubeVideoFeed() for each author, then splice them
  together into one feed. As I'm also working with Boss, and eventually
  Twitter, I'll have feeds to pull from those sources as well.

  My current application layout is using appengine-patch to provide
  django. I've set up a Boss and Youtube model with get methods that
  handle getting the data. So I can do something similar to:

  web_results = models.Boss.get(request.GET['term'], start=start)
  news_results = models.Boss.get(request.GET['term'], vertical=news,
  start=start)
  youtube = models.Youtube.get(request.GET['term'], start=start)

  Ideally, I'd like some of those models to be able to do asynchronous
  tasks within their get function, and then also, I'd like to run the
  above requests at the same, which should really speed the request up.

  On Mar 17, 9:20 am, Joe Bowman bowman.jos...@gmail.com wrote:
  Thanks,

  I'm going to give it a go for urlfetch calls for one project I'm
  working on this week.

  Not sure when I'd be able to include it in gaeutiltiies for cron and
  such, that project is currently lower on my priority list at the
  moment, but can't wait until I get a chance to play with it. Another
  idea I had for it is the ROTmodel (retry on timeout model) in the
  project, which could speed that process up.

  On Mar 17, 9:11 am, David Wilson d...@botanicus.net wrote:

   2009/3/16 Joe Bowman bowman.jos...@gmail.com:

Wow that's great. The SDK might be problematic for you, as it appears
to be very single threaded, I know for a fact it can't reply to
requests to itself.

Out of curiosity, are you still using base urlfetch, or is it your own
creation? While when Google releases their scheduled tasks
functionality it will be less of an issue, if your solution had the
ability to fire off urlfetch calls and not wait for a response, it
could be a perfect fit for the gaeutilities cron utility.

Currently it grabs a list of tasks it's supposed to run on request,
sets a timestamp, runs one, the compares now() to the timestamp and if
the timedelta is more than 1 second, stops running tasks and finishes
the request. It already appears your project would be perfect for
running all necessary tasks at once, and the MIT License I believe is
compatible with the BSD license I've released gaeutilities, so would
you have any personal objection to me including it in gaeutilities at
some point, with proper attribution of course?

   Sorry I missed this in the first reply - yeah work away! :)

   David

If you haven't see that project, it's url 
ishttp://gaeutilities.appspot.com/

On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote:
Joe,

I've only tested it in production. ;)

The code should work serially on the SDK, but I haven't tried yet.

David.

2009/3/16 Joe Bowman bowman.jos...@gmail.com:

 Does the batch fetching 

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-18 Thread bFlood

hey david,joe

I've got the async datastore Get working but I'm not sure the
callbacks are being run on a background thread. they appear to be when
you examine something like the thread local storage (hashes are all
unique) but then if you insert just a simple time.sleep they appear to
run serially. (note - while not completely new to async code, this is
my first run with python so I'm not sure of the threading contentions
of something like sleep or logging.debug)

I would like to be able to run some code just after the fetch for each
entity, the hope is that this would be run in parallel

any thoughts?

cheers
brian

On Mar 18, 6:14 am, Joe Bowman bowman.jos...@gmail.com wrote:
 Ah ha.. thanks David.

 And for the views, if I really wanted to launch everything at once, I
 could map my boss, youtube, twitter, etc etc pulls to their own urls,
 and use megafetch in my master view to pull those urls all at once
 too.

 On Mar 18, 5:14 am, David Wilson d...@botanicus.net wrote:

  Hey Joe,

  With the gdata package you can do something like this instead:

  As usual, completely untested code, but looks about right..

  from youtube import YouTubeVideoFeedFromString

  def get_feeds_async(usernames):
      fetcher = megafetch.Fetcher()
      output = {}

      def cb(username, result):
          if isinstance(output, Exception):
              logging.error('could not fetch: %s', output)
              content = None
          else:
              content = YouTubeVideoFeedFromString(result.content)
          output[username] = content

      for username in usernames:
          url = 'http://gdata.youtube.com/feeds/api/users/%s/uploads'%\
              (username,)
          fetcher.start(url, lambda result: cb(username, result))

      fetcher.wait()
      return output

  feeds = get_feeds_async([ 'davemw', 'waverlyflams', 'googletechtalks',
                            'TheOnion', 'winterelaxation' ])

  # feeds is now a mapping of usernames to YouTubeVideoFeed instances,
  or None if could not be fetched.

  2009/3/18 Joe Bowman bowman.jos...@gmail.com:

   This may be a really dumb question, but.. I'm still learning so...

   Is there a way to do something other than a direct api call
   asynchronously? I'm writing a script that pulls from multiple sources,
   sometimes with higher level calls that use urlfetch, such as gdata.
   Since I'm attempting to pull from multiple sources, and sometimes
   multiple urls from each source, I'm trying to figure out if it's
   possible to run other methods at the same time.

   For example, I want to pull a youtube entry for several different
   authors. The youtube api doesn't allow multiple authors in a request
   (I have a enhancement request in for that though), so I need to do a
   yt_service.GetYouTubeVideoFeed() for each author, then splice them
   together into one feed. As I'm also working with Boss, and eventually
   Twitter, I'll have feeds to pull from those sources as well.

   My current application layout is using appengine-patch to provide
   django. I've set up a Boss and Youtube model with get methods that
   handle getting the data. So I can do something similar to:

   web_results = models.Boss.get(request.GET['term'], start=start)
   news_results = models.Boss.get(request.GET['term'], vertical=news,
   start=start)
   youtube = models.Youtube.get(request.GET['term'], start=start)

   Ideally, I'd like some of those models to be able to do asynchronous
   tasks within their get function, and then also, I'd like to run the
   above requests at the same, which should really speed the request up.

   On Mar 17, 9:20 am, Joe Bowman bowman.jos...@gmail.com wrote:
   Thanks,

   I'm going to give it a go for urlfetch calls for one project I'm
   working on this week.

   Not sure when I'd be able to include it in gaeutiltiies for cron and
   such, that project is currently lower on my priority list at the
   moment, but can't wait until I get a chance to play with it. Another
   idea I had for it is the ROTmodel (retry on timeout model) in the
   project, which could speed that process up.

   On Mar 17, 9:11 am, David Wilson d...@botanicus.net wrote:

2009/3/16 Joe Bowman bowman.jos...@gmail.com:

 Wow that's great. The SDK might be problematic for you, as it appears
 to be very single threaded, I know for a fact it can't reply to
 requests to itself.

 Out of curiosity, are you still using base urlfetch, or is it your 
 own
 creation? While when Google releases their scheduled tasks
 functionality it will be less of an issue, if your solution had the
 ability to fire off urlfetch calls and not wait for a response, it
 could be a perfect fit for the gaeutilities cron utility.

 Currently it grabs a list of tasks it's supposed to run on request,
 sets a timestamp, runs one, the compares now() to the timestamp and 
 if
 the timedelta is more than 1 second, stops running tasks and finishes
 the 

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-17 Thread Joe Bowman

This may be a really dumb question, but.. I'm still learning so...

Is there a way to do something other than a direct api call
asynchronously? I'm writing a script that pulls from multiple sources,
sometimes with higher level calls that use urlfetch, such as gdata.
Since I'm attempting to pull from multiple sources, and sometimes
multiple urls from each source, I'm trying to figure out if it's
possible to run other methods at the same time.

For example, I want to pull a youtube entry for several different
authors. The youtube api doesn't allow multiple authors in a request
(I have a enhancement request in for that though), so I need to do a
yt_service.GetYouTubeVideoFeed() for each author, then splice them
together into one feed. As I'm also working with Boss, and eventually
Twitter, I'll have feeds to pull from those sources as well.

My current application layout is using appengine-patch to provide
django. I've set up a Boss and Youtube model with get methods that
handle getting the data. So I can do something similar to:

web_results = models.Boss.get(request.GET['term'], start=start)
news_results = models.Boss.get(request.GET['term'], vertical=news,
start=start)
youtube = models.Youtube.get(request.GET['term'], start=start)

Ideally, I'd like some of those models to be able to do asynchronous
tasks within their get function, and then also, I'd like to run the
above requests at the same, which should really speed the request up.


On Mar 17, 9:20 am, Joe Bowman bowman.jos...@gmail.com wrote:
 Thanks,

 I'm going to give it a go for urlfetch calls for one project I'm
 working on this week.

 Not sure when I'd be able to include it in gaeutiltiies for cron and
 such, that project is currently lower on my priority list at the
 moment, but can't wait until I get a chance to play with it. Another
 idea I had for it is the ROTmodel (retry on timeout model) in the
 project, which could speed that process up.

 On Mar 17, 9:11 am, David Wilson d...@botanicus.net wrote:

  2009/3/16 Joe Bowman bowman.jos...@gmail.com:

   Wow that's great. The SDK might be problematic for you, as it appears
   to be very single threaded, I know for a fact it can't reply to
   requests to itself.

   Out of curiosity, are you still using base urlfetch, or is it your own
   creation? While when Google releases their scheduled tasks
   functionality it will be less of an issue, if your solution had the
   ability to fire off urlfetch calls and not wait for a response, it
   could be a perfect fit for the gaeutilities cron utility.

   Currently it grabs a list of tasks it's supposed to run on request,
   sets a timestamp, runs one, the compares now() to the timestamp and if
   the timedelta is more than 1 second, stops running tasks and finishes
   the request. It already appears your project would be perfect for
   running all necessary tasks at once, and the MIT License I believe is
   compatible with the BSD license I've released gaeutilities, so would
   you have any personal objection to me including it in gaeutilities at
   some point, with proper attribution of course?

  Sorry I missed this in the first reply - yeah work away! :)

  David

   If you haven't see that project, it's url 
   ishttp://gaeutilities.appspot.com/

   On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote:
   Joe,

   I've only tested it in production. ;)

   The code should work serially on the SDK, but I haven't tried yet.

   David.

   2009/3/16 Joe Bowman bowman.jos...@gmail.com:

Does the batch fetching working on live appengine applications, or
only on the SDK?

On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote:
I have no idea how definitive this is, but literally it means wall
clock time seems to be how CPU cost is measured. I guess this makes
sense for a few different reasons.

I found some internal function
google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ
 est_cpu_usage
with the docstring:

    Returns the number of megacycles used so far by this request.
    Does not include CPU used by API calls.

Calling it, then running time.sleep(5), then calling it again,
indicates thousands of megacycles used, yet in real terms the CPU was
probably doing nothing. I guess Datastore CPU, etc., is added on top
of this, but it seems to suggest to me that if you can drastically
reduce request time, quota usage should drop too.

I have yet to do any kind of rough measurements of Datastore CPU, so
I'm not sure how correct this all is.

David.

 - One of the guys on IRC suggested this means that per-request cost
is scaled during peak usage (and thus internal services running
slower).

2009/3/16 peterk peter.ke...@gmail.com:

 A couple of questions re. CPU usage..

 CPU time quota appears to be calculated based on literal time

 Can you clarify what you mean here? I presume each async request 
 eats
 

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread peterk

Very neat.. Thank you.

Just to clarify, can we use this for all API calls? Datastore too? I
didn't look very closely at the async proxy in pubsubhubub..

Asynchronous calls available on all apis might give a lot to chew
on.. :) It's been a while since I've worked with async function calls
or threading, might have to dig up some old notes to see where I could
extract gains from it in my app. Some common cases might be worth the
community documenting for all to benefit from, too.

On Mar 16, 1:26 pm, David Wilson d...@botanicus.net wrote:
 I've created a Google Code project to contain some batch utilities I'm
 working on, based on async_apiproxy.py from pubsubhubbub[0]. The
 project currently contains just a modified async_apiproxy.py that
 doesn't require dummy google3 modules on the local machine, and a
 megafetch.py, for batch-fetching URLs.

    http://code.google.com/p/appengine-async-tools/

 David

 [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a...

 --
 It is better to be wrong than to be vague.
   — Freeman Dyson
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~--~~~~--~~--~--~---



[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread David Wilson

It's completely undocumented (at this stage, anyway), but definitely
seems to work. A few notes I've come gathered:

 - CPU time quota appears to be calculated based on literal time,
rather than e.g. the UNIX concept of time spent in running state.

 - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
Germany using the asynchronous API. I can't begin to imagine how slow
(and therefore expensive in monetary terms) this would be using the
standard API.

 - The user-specified callback function appears to be invoked in a
separate thread; the RPC isn't complete until this callback
completes. The callback thread is still subject to the request
deadline.

 - It's a standard interface, and seems to have no parallel
restrictions at least for urlfetch and Datastore. However, I imagine
that it's possible restrictions may be placed here at some later
stage, since you can burn a whole lot more AppEngine CPU more cheaply
using the async api.

 - It's standard only insomuch as you have to fiddle with
AppEngine-internal protocolbuffer definitions for each service type.
This mostly means copy-pasting the standard sync call code from the
SDK, and hacking it to use pubsubhubub's proxy code.

Per the last point, you might be better waiting for an officially
sanctioned API for doing this, albeit I doubt the protocolbuffer
definitions change all that often.

Thanks for Brett Slatkin  co. for doing the digging required to get
the async stuff working! :)


David.

2009/3/16 peterk peter.ke...@gmail.com:

 Very neat.. Thank you.

 Just to clarify, can we use this for all API calls? Datastore too? I
 didn't look very closely at the async proxy in pubsubhubub..

 Asynchronous calls available on all apis might give a lot to chew
 on.. :) It's been a while since I've worked with async function calls
 or threading, might have to dig up some old notes to see where I could
 extract gains from it in my app. Some common cases might be worth the
 community documenting for all to benefit from, too.

 On Mar 16, 1:26 pm, David Wilson d...@botanicus.net wrote:
 I've created a Google Code project to contain some batch utilities I'm
 working on, based on async_apiproxy.py from pubsubhubbub[0]. The
 project currently contains just a modified async_apiproxy.py that
 doesn't require dummy google3 modules on the local machine, and a
 megafetch.py, for batch-fetching URLs.

    http://code.google.com/p/appengine-async-tools/

 David

 [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a...

 --
 It is better to be wrong than to be vague.
   — Freeman Dyson
 




-- 
It is better to be wrong than to be vague.
  — Freeman Dyson

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~--~~~~--~~--~--~---



[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread peterk

A couple of questions re. CPU usage..

CPU time quota appears to be calculated based on literal time

Can you clarify what you mean here? I presume each async request eats
into your CPU budget. But you say:

since you can burn a whole lot more AppEngine CPU more cheaply using
the async api

Can you clarify how that's the case?

I would guess as long as you're being billed for the cpu-ms spent in
your asynchronous calls, Google would let you hang yourself with them
when it comes to billing.. :) so I presume they'd let you squeeze in
as many as your original request, and its limit, will allow for?

Thanks again.


On Mar 16, 2:00 pm, David Wilson d...@botanicus.net wrote:
 It's completely undocumented (at this stage, anyway), but definitely
 seems to work. A few notes I've come gathered:

  - CPU time quota appears to be calculated based on literal time,
 rather than e.g. the UNIX concept of time spent in running state.

  - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
 Germany using the asynchronous API. I can't begin to imagine how slow
 (and therefore expensive in monetary terms) this would be using the
 standard API.

  - The user-specified callback function appears to be invoked in a
 separate thread; the RPC isn't complete until this callback
 completes. The callback thread is still subject to the request
 deadline.

  - It's a standard interface, and seems to have no parallel
 restrictions at least for urlfetch and Datastore. However, I imagine
 that it's possible restrictions may be placed here at some later
 stage, since you can burn a whole lot more AppEngine CPU more cheaply
 using the async api.

  - It's standard only insomuch as you have to fiddle with
 AppEngine-internal protocolbuffer definitions for each service type.
 This mostly means copy-pasting the standard sync call code from the
 SDK, and hacking it to use pubsubhubub's proxy code.

 Per the last point, you might be better waiting for an officially
 sanctioned API for doing this, albeit I doubt the protocolbuffer
 definitions change all that often.

 Thanks for Brett Slatkin  co. for doing the digging required to get
 the async stuff working! :)

 David.

 2009/3/16 peterk peter.ke...@gmail.com:





  Very neat.. Thank you.

  Just to clarify, can we use this for all API calls? Datastore too? I
  didn't look very closely at the async proxy in pubsubhubub..

  Asynchronous calls available on all apis might give a lot to chew
  on.. :) It's been a while since I've worked with async function calls
  or threading, might have to dig up some old notes to see where I could
  extract gains from it in my app. Some common cases might be worth the
  community documenting for all to benefit from, too.

  On Mar 16, 1:26 pm, David Wilson d...@botanicus.net wrote:
  I've created a Google Code project to contain some batch utilities I'm
  working on, based on async_apiproxy.py from pubsubhubbub[0]. The
  project currently contains just a modified async_apiproxy.py that
  doesn't require dummy google3 modules on the local machine, and a
  megafetch.py, for batch-fetching URLs.

     http://code.google.com/p/appengine-async-tools/

  David

  [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a...

  --
  It is better to be wrong than to be vague.
    — Freeman Dyson

 --
 It is better to be wrong than to be vague.
   — Freeman Dyson
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~--~~~~--~~--~--~---



[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread bFlood

oh my, this is working now?!? I just assumed it would only be
available from the next build. great work david!

I agree on waiting for the official release but its certainly
something that we can test with right now in preparation for the new
release.

thanks for digging this out (and thanks to Brett Slatkin as well)

cheers
brian

On Mar 16, 10:00 am, David Wilson d...@botanicus.net wrote:
 It's completely undocumented (at this stage, anyway), but definitely
 seems to work. A few notes I've come gathered:

  - CPU time quota appears to be calculated based on literal time,
 rather than e.g. the UNIX concept of time spent in running state.

  - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
 Germany using the asynchronous API. I can't begin to imagine how slow
 (and therefore expensive in monetary terms) this would be using the
 standard API.

  - The user-specified callback function appears to be invoked in a
 separate thread; the RPC isn't complete until this callback
 completes. The callback thread is still subject to the request
 deadline.

  - It's a standard interface, and seems to have no parallel
 restrictions at least for urlfetch and Datastore. However, I imagine
 that it's possible restrictions may be placed here at some later
 stage, since you can burn a whole lot more AppEngine CPU more cheaply
 using the async api.

  - It's standard only insomuch as you have to fiddle with
 AppEngine-internal protocolbuffer definitions for each service type.
 This mostly means copy-pasting the standard sync call code from the
 SDK, and hacking it to use pubsubhubub's proxy code.

 Per the last point, you might be better waiting for an officially
 sanctioned API for doing this, albeit I doubt the protocolbuffer
 definitions change all that often.

 Thanks for Brett Slatkin  co. for doing the digging required to get
 the async stuff working! :)

 David.

 2009/3/16 peterk peter.ke...@gmail.com:





  Very neat.. Thank you.

  Just to clarify, can we use this for all API calls? Datastore too? I
  didn't look very closely at the async proxy in pubsubhubub..

  Asynchronous calls available on all apis might give a lot to chew
  on.. :) It's been a while since I've worked with async function calls
  or threading, might have to dig up some old notes to see where I could
  extract gains from it in my app. Some common cases might be worth the
  community documenting for all to benefit from, too.

  On Mar 16, 1:26 pm, David Wilson d...@botanicus.net wrote:
  I've created a Google Code project to contain some batch utilities I'm
  working on, based on async_apiproxy.py from pubsubhubbub[0]. The
  project currently contains just a modified async_apiproxy.py that
  doesn't require dummy google3 modules on the local machine, and a
  megafetch.py, for batch-fetching URLs.

     http://code.google.com/p/appengine-async-tools/

  David

  [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a...

  --
  It is better to be wrong than to be vague.
    — Freeman Dyson

 --
 It is better to be wrong than to be vague.
   — Freeman Dyson
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~--~~~~--~~--~--~---



[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread David Wilson

I have no idea how definitive this is, but literally it means wall
clock time seems to be how CPU cost is measured. I guess this makes
sense for a few different reasons.

I found some internal function
google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_request_cpu_usage
with the docstring:

Returns the number of megacycles used so far by this request.
Does not include CPU used by API calls.

Calling it, then running time.sleep(5), then calling it again,
indicates thousands of megacycles used, yet in real terms the CPU was
probably doing nothing. I guess Datastore CPU, etc., is added on top
of this, but it seems to suggest to me that if you can drastically
reduce request time, quota usage should drop too.

I have yet to do any kind of rough measurements of Datastore CPU, so
I'm not sure how correct this all is.


David.

 - One of the guys on IRC suggested this means that per-request cost
is scaled during peak usage (and thus internal services running
slower).

2009/3/16 peterk peter.ke...@gmail.com:

 A couple of questions re. CPU usage..

 CPU time quota appears to be calculated based on literal time

 Can you clarify what you mean here? I presume each async request eats
 into your CPU budget. But you say:

 since you can burn a whole lot more AppEngine CPU more cheaply using
 the async api

 Can you clarify how that's the case?

 I would guess as long as you're being billed for the cpu-ms spent in
 your asynchronous calls, Google would let you hang yourself with them
 when it comes to billing.. :) so I presume they'd let you squeeze in
 as many as your original request, and its limit, will allow for?

 Thanks again.


 On Mar 16, 2:00 pm, David Wilson d...@botanicus.net wrote:
 It's completely undocumented (at this stage, anyway), but definitely
 seems to work. A few notes I've come gathered:

  - CPU time quota appears to be calculated based on literal time,
 rather than e.g. the UNIX concept of time spent in running state.

  - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
 Germany using the asynchronous API. I can't begin to imagine how slow
 (and therefore expensive in monetary terms) this would be using the
 standard API.

  - The user-specified callback function appears to be invoked in a
 separate thread; the RPC isn't complete until this callback
 completes. The callback thread is still subject to the request
 deadline.

  - It's a standard interface, and seems to have no parallel
 restrictions at least for urlfetch and Datastore. However, I imagine
 that it's possible restrictions may be placed here at some later
 stage, since you can burn a whole lot more AppEngine CPU more cheaply
 using the async api.

  - It's standard only insomuch as you have to fiddle with
 AppEngine-internal protocolbuffer definitions for each service type.
 This mostly means copy-pasting the standard sync call code from the
 SDK, and hacking it to use pubsubhubub's proxy code.

 Per the last point, you might be better waiting for an officially
 sanctioned API for doing this, albeit I doubt the protocolbuffer
 definitions change all that often.

 Thanks for Brett Slatkin  co. for doing the digging required to get
 the async stuff working! :)

 David.

 2009/3/16 peterk peter.ke...@gmail.com:





  Very neat.. Thank you.

  Just to clarify, can we use this for all API calls? Datastore too? I
  didn't look very closely at the async proxy in pubsubhubub..

  Asynchronous calls available on all apis might give a lot to chew
  on.. :) It's been a while since I've worked with async function calls
  or threading, might have to dig up some old notes to see where I could
  extract gains from it in my app. Some common cases might be worth the
  community documenting for all to benefit from, too.

  On Mar 16, 1:26 pm, David Wilson d...@botanicus.net wrote:
  I've created a Google Code project to contain some batch utilities I'm
  working on, based on async_apiproxy.py from pubsubhubbub[0]. The
  project currently contains just a modified async_apiproxy.py that
  doesn't require dummy google3 modules on the local machine, and a
  megafetch.py, for batch-fetching URLs.

     http://code.google.com/p/appengine-async-tools/

  David

  [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a...

  --
  It is better to be wrong than to be vague.
    — Freeman Dyson

 --
 It is better to be wrong than to be vague.
   — Freeman Dyson
 




-- 
It is better to be wrong than to be vague.
  — Freeman Dyson

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~--~~~~--~~--~--~---



[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread Joe Bowman

Does the batch fetching working on live appengine applications, or
only on the SDK?

On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote:
 I have no idea how definitive this is, but literally it means wall
 clock time seems to be how CPU cost is measured. I guess this makes
 sense for a few different reasons.

 I found some internal function
 google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_request_cpu_usage
 with the docstring:

     Returns the number of megacycles used so far by this request.
     Does not include CPU used by API calls.

 Calling it, then running time.sleep(5), then calling it again,
 indicates thousands of megacycles used, yet in real terms the CPU was
 probably doing nothing. I guess Datastore CPU, etc., is added on top
 of this, but it seems to suggest to me that if you can drastically
 reduce request time, quota usage should drop too.

 I have yet to do any kind of rough measurements of Datastore CPU, so
 I'm not sure how correct this all is.

 David.

  - One of the guys on IRC suggested this means that per-request cost
 is scaled during peak usage (and thus internal services running
 slower).

 2009/3/16 peterk peter.ke...@gmail.com:





  A couple of questions re. CPU usage..

  CPU time quota appears to be calculated based on literal time

  Can you clarify what you mean here? I presume each async request eats
  into your CPU budget. But you say:

  since you can burn a whole lot more AppEngine CPU more cheaply using
  the async api

  Can you clarify how that's the case?

  I would guess as long as you're being billed for the cpu-ms spent in
  your asynchronous calls, Google would let you hang yourself with them
  when it comes to billing.. :) so I presume they'd let you squeeze in
  as many as your original request, and its limit, will allow for?

  Thanks again.

  On Mar 16, 2:00 pm, David Wilson d...@botanicus.net wrote:
  It's completely undocumented (at this stage, anyway), but definitely
  seems to work. A few notes I've come gathered:

   - CPU time quota appears to be calculated based on literal time,
  rather than e.g. the UNIX concept of time spent in running state.

   - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
  Germany using the asynchronous API. I can't begin to imagine how slow
  (and therefore expensive in monetary terms) this would be using the
  standard API.

   - The user-specified callback function appears to be invoked in a
  separate thread; the RPC isn't complete until this callback
  completes. The callback thread is still subject to the request
  deadline.

   - It's a standard interface, and seems to have no parallel
  restrictions at least for urlfetch and Datastore. However, I imagine
  that it's possible restrictions may be placed here at some later
  stage, since you can burn a whole lot more AppEngine CPU more cheaply
  using the async api.

   - It's standard only insomuch as you have to fiddle with
  AppEngine-internal protocolbuffer definitions for each service type.
  This mostly means copy-pasting the standard sync call code from the
  SDK, and hacking it to use pubsubhubub's proxy code.

  Per the last point, you might be better waiting for an officially
  sanctioned API for doing this, albeit I doubt the protocolbuffer
  definitions change all that often.

  Thanks for Brett Slatkin  co. for doing the digging required to get
  the async stuff working! :)

  David.

  2009/3/16 peterk peter.ke...@gmail.com:

   Very neat.. Thank you.

   Just to clarify, can we use this for all API calls? Datastore too? I
   didn't look very closely at the async proxy in pubsubhubub..

   Asynchronous calls available on all apis might give a lot to chew
   on.. :) It's been a while since I've worked with async function calls
   or threading, might have to dig up some old notes to see where I could
   extract gains from it in my app. Some common cases might be worth the
   community documenting for all to benefit from, too.

   On Mar 16, 1:26 pm, David Wilson d...@botanicus.net wrote:
   I've created a Google Code project to contain some batch utilities I'm
   working on, based on async_apiproxy.py from pubsubhubbub[0]. The
   project currently contains just a modified async_apiproxy.py that
   doesn't require dummy google3 modules on the local machine, and a
   megafetch.py, for batch-fetching URLs.

      http://code.google.com/p/appengine-async-tools/

   David

   [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a...

   --
   It is better to be wrong than to be vague.
     — Freeman Dyson

  --
  It is better to be wrong than to be vague.
    — Freeman Dyson

 --
 It is better to be wrong than to be vague.
   — Freeman Dyson
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, 

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread David Wilson

Joe,

I've only tested it in production. ;)

The code should work serially on the SDK, but I haven't tried yet.


David.

2009/3/16 Joe Bowman bowman.jos...@gmail.com:

 Does the batch fetching working on live appengine applications, or
 only on the SDK?

 On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote:
 I have no idea how definitive this is, but literally it means wall
 clock time seems to be how CPU cost is measured. I guess this makes
 sense for a few different reasons.

 I found some internal function
 google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_request_cpu_usage
 with the docstring:

     Returns the number of megacycles used so far by this request.
     Does not include CPU used by API calls.

 Calling it, then running time.sleep(5), then calling it again,
 indicates thousands of megacycles used, yet in real terms the CPU was
 probably doing nothing. I guess Datastore CPU, etc., is added on top
 of this, but it seems to suggest to me that if you can drastically
 reduce request time, quota usage should drop too.

 I have yet to do any kind of rough measurements of Datastore CPU, so
 I'm not sure how correct this all is.

 David.

  - One of the guys on IRC suggested this means that per-request cost
 is scaled during peak usage (and thus internal services running
 slower).

 2009/3/16 peterk peter.ke...@gmail.com:





  A couple of questions re. CPU usage..

  CPU time quota appears to be calculated based on literal time

  Can you clarify what you mean here? I presume each async request eats
  into your CPU budget. But you say:

  since you can burn a whole lot more AppEngine CPU more cheaply using
  the async api

  Can you clarify how that's the case?

  I would guess as long as you're being billed for the cpu-ms spent in
  your asynchronous calls, Google would let you hang yourself with them
  when it comes to billing.. :) so I presume they'd let you squeeze in
  as many as your original request, and its limit, will allow for?

  Thanks again.

  On Mar 16, 2:00 pm, David Wilson d...@botanicus.net wrote:
  It's completely undocumented (at this stage, anyway), but definitely
  seems to work. A few notes I've come gathered:

   - CPU time quota appears to be calculated based on literal time,
  rather than e.g. the UNIX concept of time spent in running state.

   - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
  Germany using the asynchronous API. I can't begin to imagine how slow
  (and therefore expensive in monetary terms) this would be using the
  standard API.

   - The user-specified callback function appears to be invoked in a
  separate thread; the RPC isn't complete until this callback
  completes. The callback thread is still subject to the request
  deadline.

   - It's a standard interface, and seems to have no parallel
  restrictions at least for urlfetch and Datastore. However, I imagine
  that it's possible restrictions may be placed here at some later
  stage, since you can burn a whole lot more AppEngine CPU more cheaply
  using the async api.

   - It's standard only insomuch as you have to fiddle with
  AppEngine-internal protocolbuffer definitions for each service type.
  This mostly means copy-pasting the standard sync call code from the
  SDK, and hacking it to use pubsubhubub's proxy code.

  Per the last point, you might be better waiting for an officially
  sanctioned API for doing this, albeit I doubt the protocolbuffer
  definitions change all that often.

  Thanks for Brett Slatkin  co. for doing the digging required to get
  the async stuff working! :)

  David.

  2009/3/16 peterk peter.ke...@gmail.com:

   Very neat.. Thank you.

   Just to clarify, can we use this for all API calls? Datastore too? I
   didn't look very closely at the async proxy in pubsubhubub..

   Asynchronous calls available on all apis might give a lot to chew
   on.. :) It's been a while since I've worked with async function calls
   or threading, might have to dig up some old notes to see where I could
   extract gains from it in my app. Some common cases might be worth the
   community documenting for all to benefit from, too.

   On Mar 16, 1:26 pm, David Wilson d...@botanicus.net wrote:
   I've created a Google Code project to contain some batch utilities I'm
   working on, based on async_apiproxy.py from pubsubhubbub[0]. The
   project currently contains just a modified async_apiproxy.py that
   doesn't require dummy google3 modules on the local machine, and a
   megafetch.py, for batch-fetching URLs.

      http://code.google.com/p/appengine-async-tools/

   David

   [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a...

   --
   It is better to be wrong than to be vague.
     — Freeman Dyson

  --
  It is better to be wrong than to be vague.
    — Freeman Dyson

 --
 It is better to be wrong than to be vague.
   — Freeman Dyson
 




-- 
It is better to be wrong than to be vague.
  — Freeman Dyson


[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread Joe Bowman

Wow that's great. The SDK might be problematic for you, as it appears
to be very single threaded, I know for a fact it can't reply to
requests to itself.

Out of curiosity, are you still using base urlfetch, or is it your own
creation? While when Google releases their scheduled tasks
functionality it will be less of an issue, if your solution had the
ability to fire off urlfetch calls and not wait for a response, it
could be a perfect fit for the gaeutilities cron utility.

Currently it grabs a list of tasks it's supposed to run on request,
sets a timestamp, runs one, the compares now() to the timestamp and if
the timedelta is more than 1 second, stops running tasks and finishes
the request. It already appears your project would be perfect for
running all necessary tasks at once, and the MIT License I believe is
compatible with the BSD license I've released gaeutilities, so would
you have any personal objection to me including it in gaeutilities at
some point, with proper attribution of course?

If you haven't see that project, it's url is http://gaeutilities.appspot.com/

On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote:
 Joe,

 I've only tested it in production. ;)

 The code should work serially on the SDK, but I haven't tried yet.

 David.

 2009/3/16 Joe Bowman bowman.jos...@gmail.com:





  Does the batch fetching working on live appengine applications, or
  only on the SDK?

  On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote:
  I have no idea how definitive this is, but literally it means wall
  clock time seems to be how CPU cost is measured. I guess this makes
  sense for a few different reasons.

  I found some internal function
  google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_request_cpu_usage
  with the docstring:

      Returns the number of megacycles used so far by this request.
      Does not include CPU used by API calls.

  Calling it, then running time.sleep(5), then calling it again,
  indicates thousands of megacycles used, yet in real terms the CPU was
  probably doing nothing. I guess Datastore CPU, etc., is added on top
  of this, but it seems to suggest to me that if you can drastically
  reduce request time, quota usage should drop too.

  I have yet to do any kind of rough measurements of Datastore CPU, so
  I'm not sure how correct this all is.

  David.

   - One of the guys on IRC suggested this means that per-request cost
  is scaled during peak usage (and thus internal services running
  slower).

  2009/3/16 peterk peter.ke...@gmail.com:

   A couple of questions re. CPU usage..

   CPU time quota appears to be calculated based on literal time

   Can you clarify what you mean here? I presume each async request eats
   into your CPU budget. But you say:

   since you can burn a whole lot more AppEngine CPU more cheaply using
   the async api

   Can you clarify how that's the case?

   I would guess as long as you're being billed for the cpu-ms spent in
   your asynchronous calls, Google would let you hang yourself with them
   when it comes to billing.. :) so I presume they'd let you squeeze in
   as many as your original request, and its limit, will allow for?

   Thanks again.

   On Mar 16, 2:00 pm, David Wilson d...@botanicus.net wrote:
   It's completely undocumented (at this stage, anyway), but definitely
   seems to work. A few notes I've come gathered:

    - CPU time quota appears to be calculated based on literal time,
   rather than e.g. the UNIX concept of time spent in running state.

    - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
   Germany using the asynchronous API. I can't begin to imagine how slow
   (and therefore expensive in monetary terms) this would be using the
   standard API.

    - The user-specified callback function appears to be invoked in a
   separate thread; the RPC isn't complete until this callback
   completes. The callback thread is still subject to the request
   deadline.

    - It's a standard interface, and seems to have no parallel
   restrictions at least for urlfetch and Datastore. However, I imagine
   that it's possible restrictions may be placed here at some later
   stage, since you can burn a whole lot more AppEngine CPU more cheaply
   using the async api.

    - It's standard only insomuch as you have to fiddle with
   AppEngine-internal protocolbuffer definitions for each service type.
   This mostly means copy-pasting the standard sync call code from the
   SDK, and hacking it to use pubsubhubub's proxy code.

   Per the last point, you might be better waiting for an officially
   sanctioned API for doing this, albeit I doubt the protocolbuffer
   definitions change all that often.

   Thanks for Brett Slatkin  co. for doing the digging required to get
   the async stuff working! :)

   David.

   2009/3/16 peterk peter.ke...@gmail.com:

Very neat.. Thank you.

Just to clarify, can we use this for all API calls? Datastore too? I
didn't look 

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread bFlood


@joe - fire/forget - you can just skip the fetcher.wait() call (which
call AsyncAPIProxy.wait). I'm not sure of you would need a valid
callback but even if you did it could be a simple stub that does
nothing.

@david - have you made this work with datastore calls yet? having some
issues trying to figure out how to set pbrequest/pbresponse variables

cheers
brian


On Mar 16, 12:05 pm, Joe Bowman bowman.jos...@gmail.com wrote:
 Wow that's great. The SDK might be problematic for you, as it appears
 to be very single threaded, I know for a fact it can't reply to
 requests to itself.

 Out of curiosity, are you still using base urlfetch, or is it your own
 creation? While when Google releases their scheduled tasks
 functionality it will be less of an issue, if your solution had the
 ability to fire off urlfetch calls and not wait for a response, it
 could be a perfect fit for the gaeutilities cron utility.

 Currently it grabs a list of tasks it's supposed to run on request,
 sets a timestamp, runs one, the compares now() to the timestamp and if
 the timedelta is more than 1 second, stops running tasks and finishes
 the request. It already appears your project would be perfect for
 running all necessary tasks at once, and the MIT License I believe is
 compatible with the BSD license I've released gaeutilities, so would
 you have any personal objection to me including it in gaeutilities at
 some point, with proper attribution of course?

 If you haven't see that project, it's url ishttp://gaeutilities.appspot.com/

 On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote:

  Joe,

  I've only tested it in production. ;)

  The code should work serially on the SDK, but I haven't tried yet.

  David.

  2009/3/16 Joe Bowman bowman.jos...@gmail.com:

   Does the batch fetching working on live appengine applications, or
   only on the SDK?

   On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote:
   I have no idea how definitive this is, but literally it means wall
   clock time seems to be how CPU cost is measured. I guess this makes
   sense for a few different reasons.

   I found some internal function
   google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ
est_cpu_usage
   with the docstring:

       Returns the number of megacycles used so far by this request.
       Does not include CPU used by API calls.

   Calling it, then running time.sleep(5), then calling it again,
   indicates thousands of megacycles used, yet in real terms the CPU was
   probably doing nothing. I guess Datastore CPU, etc., is added on top
   of this, but it seems to suggest to me that if you can drastically
   reduce request time, quota usage should drop too.

   I have yet to do any kind of rough measurements of Datastore CPU, so
   I'm not sure how correct this all is.

   David.

    - One of the guys on IRC suggested this means that per-request cost
   is scaled during peak usage (and thus internal services running
   slower).

   2009/3/16 peterk peter.ke...@gmail.com:

A couple of questions re. CPU usage..

CPU time quota appears to be calculated based on literal time

Can you clarify what you mean here? I presume each async request eats
into your CPU budget. But you say:

since you can burn a whole lot more AppEngine CPU more cheaply using
the async api

Can you clarify how that's the case?

I would guess as long as you're being billed for the cpu-ms spent in
your asynchronous calls, Google would let you hang yourself with them
when it comes to billing.. :) so I presume they'd let you squeeze in
as many as your original request, and its limit, will allow for?

Thanks again.

On Mar 16, 2:00 pm, David Wilson d...@botanicus.net wrote:
It's completely undocumented (at this stage, anyway), but definitely
seems to work. A few notes I've come gathered:

 - CPU time quota appears to be calculated based on literal time,
rather than e.g. the UNIX concept of time spent in running state.

 - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
Germany using the asynchronous API. I can't begin to imagine how slow
(and therefore expensive in monetary terms) this would be using the
standard API.

 - The user-specified callback function appears to be invoked in a
separate thread; the RPC isn't complete until this callback
completes. The callback thread is still subject to the request
deadline.

 - It's a standard interface, and seems to have no parallel
restrictions at least for urlfetch and Datastore. However, I imagine
that it's possible restrictions may be placed here at some later
stage, since you can burn a whole lot more AppEngine CPU more cheaply
using the async api.

 - It's standard only insomuch as you have to fiddle with
AppEngine-internal protocolbuffer definitions for each service type.
This mostly means copy-pasting the standard sync call code from 

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread David Wilson

I forgot to mention, AppEngine does not close the request until all
asynchronous requests have ended. This means it's not truly fire and
forget. Regardless of whether you're waiting for a response or not,
if a request is in progress, the HTTP response body is not returned to
the client.

I created a simple function this morning to call datastore_v3.Delete
on a set of key objects, it appeared to work but I didn't test beyond
ensuring the callback didn't receive an exception. Pretty untested
code here: http://pastie.org/417496.

For simple uses, it's probably not all that useful to call Datastore
asynchronously is all that useful anyway, since unlike urlfetch, you
can already minimize latency by making batch calls at the start/end of
your request for all the keys you want to load/save. It's possibly
useful to use it to concurrently commit a bunch of different
transactions, but the code for this is less trivial than the urlfetch
case. Probably best to see what the AppEngine team themselves provide
for this. ;)


David.

2009/3/16 bFlood bflood...@gmail.com:


 @joe - fire/forget - you can just skip the fetcher.wait() call (which
 call AsyncAPIProxy.wait). I'm not sure of you would need a valid
 callback but even if you did it could be a simple stub that does
 nothing.

 @david - have you made this work with datastore calls yet? having some
 issues trying to figure out how to set pbrequest/pbresponse variables

 cheers
 brian


 On Mar 16, 12:05 pm, Joe Bowman bowman.jos...@gmail.com wrote:
 Wow that's great. The SDK might be problematic for you, as it appears
 to be very single threaded, I know for a fact it can't reply to
 requests to itself.

 Out of curiosity, are you still using base urlfetch, or is it your own
 creation? While when Google releases their scheduled tasks
 functionality it will be less of an issue, if your solution had the
 ability to fire off urlfetch calls and not wait for a response, it
 could be a perfect fit for the gaeutilities cron utility.

 Currently it grabs a list of tasks it's supposed to run on request,
 sets a timestamp, runs one, the compares now() to the timestamp and if
 the timedelta is more than 1 second, stops running tasks and finishes
 the request. It already appears your project would be perfect for
 running all necessary tasks at once, and the MIT License I believe is
 compatible with the BSD license I've released gaeutilities, so would
 you have any personal objection to me including it in gaeutilities at
 some point, with proper attribution of course?

 If you haven't see that project, it's url ishttp://gaeutilities.appspot.com/

 On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote:

  Joe,

  I've only tested it in production. ;)

  The code should work serially on the SDK, but I haven't tried yet.

  David.

  2009/3/16 Joe Bowman bowman.jos...@gmail.com:

   Does the batch fetching working on live appengine applications, or
   only on the SDK?

   On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote:
   I have no idea how definitive this is, but literally it means wall
   clock time seems to be how CPU cost is measured. I guess this makes
   sense for a few different reasons.

   I found some internal function
   google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ
est_cpu_usage
   with the docstring:

       Returns the number of megacycles used so far by this request.
       Does not include CPU used by API calls.

   Calling it, then running time.sleep(5), then calling it again,
   indicates thousands of megacycles used, yet in real terms the CPU was
   probably doing nothing. I guess Datastore CPU, etc., is added on top
   of this, but it seems to suggest to me that if you can drastically
   reduce request time, quota usage should drop too.

   I have yet to do any kind of rough measurements of Datastore CPU, so
   I'm not sure how correct this all is.

   David.

    - One of the guys on IRC suggested this means that per-request cost
   is scaled during peak usage (and thus internal services running
   slower).

   2009/3/16 peterk peter.ke...@gmail.com:

A couple of questions re. CPU usage..

CPU time quota appears to be calculated based on literal time

Can you clarify what you mean here? I presume each async request eats
into your CPU budget. But you say:

since you can burn a whole lot more AppEngine CPU more cheaply using
the async api

Can you clarify how that's the case?

I would guess as long as you're being billed for the cpu-ms spent in
your asynchronous calls, Google would let you hang yourself with them
when it comes to billing.. :) so I presume they'd let you squeeze in
as many as your original request, and its limit, will allow for?

Thanks again.

On Mar 16, 2:00 pm, David Wilson d...@botanicus.net wrote:
It's completely undocumented (at this stage, anyway), but definitely
seems to work. A few notes I've come gathered:

 - CPU time quota 

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread Joe Bowman

I imagine keeping the request open until everything is done isn't
going to go away any time soon, it's how http responses work and the
scheduled tasks on the roadmap would be better suited to providing
better support for that. I also agree on the batch put and get
functionality for the most part is there.

My experience from mass delete scripts has been delete is extremely
heavy, and before the runtime length was extended, I came up with the
number 75 being the safe amount of entities to delete in a request
without encountering timeouts for the most part. I ended up using
javascript with a simple protocol (responses of there's more and
all done in order to delete 10k+ objects at a time). During that
time I did notice that repeated writing to the datastore (or delete in
my case) also caused other errors, which it looked like I was being
throttled, so that's something else you may encounter if you continue
to work on asynchronous datastore calls.

On Mar 16, 1:12 pm, David Wilson d...@botanicus.net wrote:
 I forgot to mention, AppEngine does not close the request until all
 asynchronous requests have ended. This means it's not truly fire and
 forget. Regardless of whether you're waiting for a response or not,
 if a request is in progress, the HTTP response body is not returned to
 the client.

 I created a simple function this morning to call datastore_v3.Delete
 on a set of key objects, it appeared to work but I didn't test beyond
 ensuring the callback didn't receive an exception. Pretty untested
 code here: http://pastie.org/417496.

 For simple uses, it's probably not all that useful to call Datastore
 asynchronously is all that useful anyway, since unlike urlfetch, you
 can already minimize latency by making batch calls at the start/end of
 your request for all the keys you want to load/save. It's possibly
 useful to use it to concurrently commit a bunch of different
 transactions, but the code for this is less trivial than the urlfetch
 case. Probably best to see what the AppEngine team themselves provide
 for this. ;)

 David.

 2009/3/16 bFlood bflood...@gmail.com:





  @joe - fire/forget - you can just skip the fetcher.wait() call (which
  call AsyncAPIProxy.wait). I'm not sure of you would need a valid
  callback but even if you did it could be a simple stub that does
  nothing.

  @david - have you made this work with datastore calls yet? having some
  issues trying to figure out how to set pbrequest/pbresponse variables

  cheers
  brian

  On Mar 16, 12:05 pm, Joe Bowman bowman.jos...@gmail.com wrote:
  Wow that's great. The SDK might be problematic for you, as it appears
  to be very single threaded, I know for a fact it can't reply to
  requests to itself.

  Out of curiosity, are you still using base urlfetch, or is it your own
  creation? While when Google releases their scheduled tasks
  functionality it will be less of an issue, if your solution had the
  ability to fire off urlfetch calls and not wait for a response, it
  could be a perfect fit for the gaeutilities cron utility.

  Currently it grabs a list of tasks it's supposed to run on request,
  sets a timestamp, runs one, the compares now() to the timestamp and if
  the timedelta is more than 1 second, stops running tasks and finishes
  the request. It already appears your project would be perfect for
  running all necessary tasks at once, and the MIT License I believe is
  compatible with the BSD license I've released gaeutilities, so would
  you have any personal objection to me including it in gaeutilities at
  some point, with proper attribution of course?

  If you haven't see that project, it's url 
  ishttp://gaeutilities.appspot.com/

  On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote:

   Joe,

   I've only tested it in production. ;)

   The code should work serially on the SDK, but I haven't tried yet.

   David.

   2009/3/16 Joe Bowman bowman.jos...@gmail.com:

Does the batch fetching working on live appengine applications, or
only on the SDK?

On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote:
I have no idea how definitive this is, but literally it means wall
clock time seems to be how CPU cost is measured. I guess this makes
sense for a few different reasons.

I found some internal function
google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ
 est_cpu_usage
with the docstring:

    Returns the number of megacycles used so far by this request.
    Does not include CPU used by API calls.

Calling it, then running time.sleep(5), then calling it again,
indicates thousands of megacycles used, yet in real terms the CPU was
probably doing nothing. I guess Datastore CPU, etc., is added on top
of this, but it seems to suggest to me that if you can drastically
reduce request time, quota usage should drop too.

I have yet to do any kind of rough measurements of Datastore CPU, so
I'm not sure how correct this 

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread bFlood

thanks david.

agreed on datastore except that unlike the current batch calls, you
might be able to execute code concurrently on each response and then
wait for all the worker's results. to me, and I could be wrong, even a
no-op datastore request could serve as a poor man's worker thread.
I'll see if I can get it working on our stuff and report back (did you
happen to notice if all the threads were started on the same machine?)

regardless, it will just be testing for right now. I'm sure the GAE
team has their own ideas about whats allowed with async access.

cheers and thanks again
brian





On Mar 16, 1:12 pm, David Wilson d...@botanicus.net wrote:
 I forgot to mention, AppEngine does not close the request until all
 asynchronous requests have ended. This means it's not truly fire and
 forget. Regardless of whether you're waiting for a response or not,
 if a request is in progress, the HTTP response body is not returned to
 the client.

 I created a simple function this morning to call datastore_v3.Delete
 on a set of key objects, it appeared to work but I didn't test beyond
 ensuring the callback didn't receive an exception. Pretty untested
 code here: http://pastie.org/417496.

 For simple uses, it's probably not all that useful to call Datastore
 asynchronously is all that useful anyway, since unlike urlfetch, you
 can already minimize latency by making batch calls at the start/end of
 your request for all the keys you want to load/save. It's possibly
 useful to use it to concurrently commit a bunch of different
 transactions, but the code for this is less trivial than the urlfetch
 case. Probably best to see what the AppEngine team themselves provide
 for this. ;)

 David.

 2009/3/16 bFlood bflood...@gmail.com:





  @joe - fire/forget - you can just skip the fetcher.wait() call (which
  call AsyncAPIProxy.wait). I'm not sure of you would need a valid
  callback but even if you did it could be a simple stub that does
  nothing.

  @david - have you made this work with datastore calls yet? having some
  issues trying to figure out how to set pbrequest/pbresponse variables

  cheers
  brian

  On Mar 16, 12:05 pm, Joe Bowman bowman.jos...@gmail.com wrote:
  Wow that's great. The SDK might be problematic for you, as it appears
  to be very single threaded, I know for a fact it can't reply to
  requests to itself.

  Out of curiosity, are you still using base urlfetch, or is it your own
  creation? While when Google releases their scheduled tasks
  functionality it will be less of an issue, if your solution had the
  ability to fire off urlfetch calls and not wait for a response, it
  could be a perfect fit for the gaeutilities cron utility.

  Currently it grabs a list of tasks it's supposed to run on request,
  sets a timestamp, runs one, the compares now() to the timestamp and if
  the timedelta is more than 1 second, stops running tasks and finishes
  the request. It already appears your project would be perfect for
  running all necessary tasks at once, and the MIT License I believe is
  compatible with the BSD license I've released gaeutilities, so would
  you have any personal objection to me including it in gaeutilities at
  some point, with proper attribution of course?

  If you haven't see that project, it's url 
  ishttp://gaeutilities.appspot.com/

  On Mar 16, 11:03 am, David Wilson d...@botanicus.net wrote:

   Joe,

   I've only tested it in production. ;)

   The code should work serially on the SDK, but I haven't tried yet.

   David.

   2009/3/16 Joe Bowman bowman.jos...@gmail.com:

Does the batch fetching working on live appengine applications, or
only on the SDK?

On Mar 16, 10:19 am, David Wilson d...@botanicus.net wrote:
I have no idea how definitive this is, but literally it means wall
clock time seems to be how CPU cost is measured. I guess this makes
sense for a few different reasons.

I found some internal function
google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ
 est_cpu_usage
with the docstring:

    Returns the number of megacycles used so far by this request.
    Does not include CPU used by API calls.

Calling it, then running time.sleep(5), then calling it again,
indicates thousands of megacycles used, yet in real terms the CPU was
probably doing nothing. I guess Datastore CPU, etc., is added on top
of this, but it seems to suggest to me that if you can drastically
reduce request time, quota usage should drop too.

I have yet to do any kind of rough measurements of Datastore CPU, so
I'm not sure how correct this all is.

David.

 - One of the guys on IRC suggested this means that per-request cost
is scaled during peak usage (and thus internal services running
slower).

2009/3/16 peterk peter.ke...@gmail.com:

 A couple of questions re. CPU usage..

 CPU time quota appears to be calculated based on literal time

 Can you clarify what