[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread peterk

Very neat.. Thank you.

Just to clarify, can we use this for all API calls? Datastore too? I
didn't look very closely at the async proxy in pubsubhubub..

Asynchronous calls available on all apis might give a lot to chew
on.. :) It's been a while since I've worked with async function calls
or threading, might have to dig up some old notes to see where I could
extract gains from it in my app. Some common cases might be worth the
community documenting for all to benefit from, too.

On Mar 16, 1:26 pm, David Wilson  wrote:
> I've created a Google Code project to contain some batch utilities I'm
> working on, based on async_apiproxy.py from pubsubhubbub[0]. The
> project currently contains just a modified async_apiproxy.py that
> doesn't require dummy google3 modules on the local machine, and a
> megafetch.py, for batch-fetching URLs.
>
>    http://code.google.com/p/appengine-async-tools/
>
> David
>
> [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a...
>
> --
> It is better to be wrong than to be vague.
>   — Freeman Dyson
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~--~~~~--~~--~--~---



[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread David Wilson

It's completely undocumented (at this stage, anyway), but definitely
seems to work. A few notes I've come gathered:

 - CPU time quota appears to be calculated based on literal time,
rather than e.g. the UNIX concept of "time spent in running state".

 - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
Germany using the asynchronous API. I can't begin to imagine how slow
(and therefore expensive in monetary terms) this would be using the
standard API.

 - The user-specified callback function appears to be invoked in a
separate thread; the RPC isn't "complete" until this callback
completes. The callback thread is still subject to the request
deadline.

 - It's a standard interface, and seems to have no parallel
restrictions at least for urlfetch and Datastore. However, I imagine
that it's possible restrictions may be placed here at some later
stage, since you can burn a whole lot more AppEngine CPU more cheaply
using the async api.

 - It's "standard" only insomuch as you have to fiddle with
AppEngine-internal protocolbuffer definitions for each service type.
This mostly means copy-pasting the standard sync call code from the
SDK, and hacking it to use pubsubhubub's proxy code.

Per the last point, you might be better waiting for an officially
sanctioned API for doing this, albeit I doubt the protocolbuffer
definitions change all that often.

Thanks for Brett Slatkin & co. for doing the digging required to get
the async stuff working! :)


David.

2009/3/16 peterk :
>
> Very neat.. Thank you.
>
> Just to clarify, can we use this for all API calls? Datastore too? I
> didn't look very closely at the async proxy in pubsubhubub..
>
> Asynchronous calls available on all apis might give a lot to chew
> on.. :) It's been a while since I've worked with async function calls
> or threading, might have to dig up some old notes to see where I could
> extract gains from it in my app. Some common cases might be worth the
> community documenting for all to benefit from, too.
>
> On Mar 16, 1:26 pm, David Wilson  wrote:
>> I've created a Google Code project to contain some batch utilities I'm
>> working on, based on async_apiproxy.py from pubsubhubbub[0]. The
>> project currently contains just a modified async_apiproxy.py that
>> doesn't require dummy google3 modules on the local machine, and a
>> megafetch.py, for batch-fetching URLs.
>>
>>    http://code.google.com/p/appengine-async-tools/
>>
>> David
>>
>> [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a...
>>
>> --
>> It is better to be wrong than to be vague.
>>   — Freeman Dyson
> >
>



-- 
It is better to be wrong than to be vague.
  — Freeman Dyson

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~--~~~~--~~--~--~---



[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread peterk

A couple of questions re. CPU usage..

"CPU time quota appears to be calculated based on literal time"

Can you clarify what you mean here? I presume each async request eats
into your CPU budget. But you say:

"since you can burn a whole lot more AppEngine CPU more cheaply using
the async api"

Can you clarify how that's the case?

I would guess as long as you're being billed for the cpu-ms spent in
your asynchronous calls, Google would let you hang yourself with them
when it comes to billing.. :) so I presume they'd let you squeeze in
as many as your original request, and its limit, will allow for?

Thanks again.


On Mar 16, 2:00 pm, David Wilson  wrote:
> It's completely undocumented (at this stage, anyway), but definitely
> seems to work. A few notes I've come gathered:
>
>  - CPU time quota appears to be calculated based on literal time,
> rather than e.g. the UNIX concept of "time spent in running state".
>
>  - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
> Germany using the asynchronous API. I can't begin to imagine how slow
> (and therefore expensive in monetary terms) this would be using the
> standard API.
>
>  - The user-specified callback function appears to be invoked in a
> separate thread; the RPC isn't "complete" until this callback
> completes. The callback thread is still subject to the request
> deadline.
>
>  - It's a standard interface, and seems to have no parallel
> restrictions at least for urlfetch and Datastore. However, I imagine
> that it's possible restrictions may be placed here at some later
> stage, since you can burn a whole lot more AppEngine CPU more cheaply
> using the async api.
>
>  - It's "standard" only insomuch as you have to fiddle with
> AppEngine-internal protocolbuffer definitions for each service type.
> This mostly means copy-pasting the standard sync call code from the
> SDK, and hacking it to use pubsubhubub's proxy code.
>
> Per the last point, you might be better waiting for an officially
> sanctioned API for doing this, albeit I doubt the protocolbuffer
> definitions change all that often.
>
> Thanks for Brett Slatkin & co. for doing the digging required to get
> the async stuff working! :)
>
> David.
>
> 2009/3/16 peterk :
>
>
>
>
>
> > Very neat.. Thank you.
>
> > Just to clarify, can we use this for all API calls? Datastore too? I
> > didn't look very closely at the async proxy in pubsubhubub..
>
> > Asynchronous calls available on all apis might give a lot to chew
> > on.. :) It's been a while since I've worked with async function calls
> > or threading, might have to dig up some old notes to see where I could
> > extract gains from it in my app. Some common cases might be worth the
> > community documenting for all to benefit from, too.
>
> > On Mar 16, 1:26 pm, David Wilson  wrote:
> >> I've created a Google Code project to contain some batch utilities I'm
> >> working on, based on async_apiproxy.py from pubsubhubbub[0]. The
> >> project currently contains just a modified async_apiproxy.py that
> >> doesn't require dummy google3 modules on the local machine, and a
> >> megafetch.py, for batch-fetching URLs.
>
> >>    http://code.google.com/p/appengine-async-tools/
>
> >> David
>
> >> [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a...
>
> >> --
> >> It is better to be wrong than to be vague.
> >>   — Freeman Dyson
>
> --
> It is better to be wrong than to be vague.
>   — Freeman Dyson
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~--~~~~--~~--~--~---



[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread bFlood

oh my, this is working now?!? I just assumed it would only be
available from the next build. great work david!

I agree on waiting for the "official" release but its certainly
something that we can test with right now in preparation for the new
release.

thanks for digging this out (and thanks to Brett Slatkin as well)

cheers
brian

On Mar 16, 10:00 am, David Wilson  wrote:
> It's completely undocumented (at this stage, anyway), but definitely
> seems to work. A few notes I've come gathered:
>
>  - CPU time quota appears to be calculated based on literal time,
> rather than e.g. the UNIX concept of "time spent in running state".
>
>  - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
> Germany using the asynchronous API. I can't begin to imagine how slow
> (and therefore expensive in monetary terms) this would be using the
> standard API.
>
>  - The user-specified callback function appears to be invoked in a
> separate thread; the RPC isn't "complete" until this callback
> completes. The callback thread is still subject to the request
> deadline.
>
>  - It's a standard interface, and seems to have no parallel
> restrictions at least for urlfetch and Datastore. However, I imagine
> that it's possible restrictions may be placed here at some later
> stage, since you can burn a whole lot more AppEngine CPU more cheaply
> using the async api.
>
>  - It's "standard" only insomuch as you have to fiddle with
> AppEngine-internal protocolbuffer definitions for each service type.
> This mostly means copy-pasting the standard sync call code from the
> SDK, and hacking it to use pubsubhubub's proxy code.
>
> Per the last point, you might be better waiting for an officially
> sanctioned API for doing this, albeit I doubt the protocolbuffer
> definitions change all that often.
>
> Thanks for Brett Slatkin & co. for doing the digging required to get
> the async stuff working! :)
>
> David.
>
> 2009/3/16 peterk :
>
>
>
>
>
> > Very neat.. Thank you.
>
> > Just to clarify, can we use this for all API calls? Datastore too? I
> > didn't look very closely at the async proxy in pubsubhubub..
>
> > Asynchronous calls available on all apis might give a lot to chew
> > on.. :) It's been a while since I've worked with async function calls
> > or threading, might have to dig up some old notes to see where I could
> > extract gains from it in my app. Some common cases might be worth the
> > community documenting for all to benefit from, too.
>
> > On Mar 16, 1:26 pm, David Wilson  wrote:
> >> I've created a Google Code project to contain some batch utilities I'm
> >> working on, based on async_apiproxy.py from pubsubhubbub[0]. The
> >> project currently contains just a modified async_apiproxy.py that
> >> doesn't require dummy google3 modules on the local machine, and a
> >> megafetch.py, for batch-fetching URLs.
>
> >>    http://code.google.com/p/appengine-async-tools/
>
> >> David
>
> >> [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a...
>
> >> --
> >> It is better to be wrong than to be vague.
> >>   — Freeman Dyson
>
> --
> It is better to be wrong than to be vague.
>   — Freeman Dyson
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~--~~~~--~~--~--~---



[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread David Wilson

I have no idea how definitive this is, but literally it means wall
clock time seems to be how CPU cost is measured. I guess this makes
sense for a few different reasons.

I found some internal function
"google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_request_cpu_usage"
with the docstring:

Returns the number of megacycles used so far by this request.
Does not include CPU used by API calls.

Calling it, then running time.sleep(5), then calling it again,
indicates thousands of megacycles used, yet in real terms the CPU was
probably doing nothing. I guess Datastore CPU, etc., is added on top
of this, but it seems to suggest to me that if you can drastically
reduce request time, quota usage should drop too.

I have yet to do any kind of rough measurements of Datastore CPU, so
I'm not sure how correct this all is.


David.

 - One of the guys on IRC suggested this means that per-request cost
is scaled during peak usage (and thus internal services running
slower).

2009/3/16 peterk :
>
> A couple of questions re. CPU usage..
>
> "CPU time quota appears to be calculated based on literal time"
>
> Can you clarify what you mean here? I presume each async request eats
> into your CPU budget. But you say:
>
> "since you can burn a whole lot more AppEngine CPU more cheaply using
> the async api"
>
> Can you clarify how that's the case?
>
> I would guess as long as you're being billed for the cpu-ms spent in
> your asynchronous calls, Google would let you hang yourself with them
> when it comes to billing.. :) so I presume they'd let you squeeze in
> as many as your original request, and its limit, will allow for?
>
> Thanks again.
>
>
> On Mar 16, 2:00 pm, David Wilson  wrote:
>> It's completely undocumented (at this stage, anyway), but definitely
>> seems to work. A few notes I've come gathered:
>>
>>  - CPU time quota appears to be calculated based on literal time,
>> rather than e.g. the UNIX concept of "time spent in running state".
>>
>>  - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
>> Germany using the asynchronous API. I can't begin to imagine how slow
>> (and therefore expensive in monetary terms) this would be using the
>> standard API.
>>
>>  - The user-specified callback function appears to be invoked in a
>> separate thread; the RPC isn't "complete" until this callback
>> completes. The callback thread is still subject to the request
>> deadline.
>>
>>  - It's a standard interface, and seems to have no parallel
>> restrictions at least for urlfetch and Datastore. However, I imagine
>> that it's possible restrictions may be placed here at some later
>> stage, since you can burn a whole lot more AppEngine CPU more cheaply
>> using the async api.
>>
>>  - It's "standard" only insomuch as you have to fiddle with
>> AppEngine-internal protocolbuffer definitions for each service type.
>> This mostly means copy-pasting the standard sync call code from the
>> SDK, and hacking it to use pubsubhubub's proxy code.
>>
>> Per the last point, you might be better waiting for an officially
>> sanctioned API for doing this, albeit I doubt the protocolbuffer
>> definitions change all that often.
>>
>> Thanks for Brett Slatkin & co. for doing the digging required to get
>> the async stuff working! :)
>>
>> David.
>>
>> 2009/3/16 peterk :
>>
>>
>>
>>
>>
>> > Very neat.. Thank you.
>>
>> > Just to clarify, can we use this for all API calls? Datastore too? I
>> > didn't look very closely at the async proxy in pubsubhubub..
>>
>> > Asynchronous calls available on all apis might give a lot to chew
>> > on.. :) It's been a while since I've worked with async function calls
>> > or threading, might have to dig up some old notes to see where I could
>> > extract gains from it in my app. Some common cases might be worth the
>> > community documenting for all to benefit from, too.
>>
>> > On Mar 16, 1:26 pm, David Wilson  wrote:
>> >> I've created a Google Code project to contain some batch utilities I'm
>> >> working on, based on async_apiproxy.py from pubsubhubbub[0]. The
>> >> project currently contains just a modified async_apiproxy.py that
>> >> doesn't require dummy google3 modules on the local machine, and a
>> >> megafetch.py, for batch-fetching URLs.
>>
>> >>    http://code.google.com/p/appengine-async-tools/
>>
>> >> David
>>
>> >> [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a...
>>
>> >> --
>> >> It is better to be wrong than to be vague.
>> >>   — Freeman Dyson
>>
>> --
>> It is better to be wrong than to be vague.
>>   — Freeman Dyson
> >
>



-- 
It is better to be wrong than to be vague.
  — Freeman Dyson

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread Joe Bowman

Does the batch fetching working on live appengine applications, or
only on the SDK?

On Mar 16, 10:19 am, David Wilson  wrote:
> I have no idea how definitive this is, but literally it means wall
> clock time seems to be how CPU cost is measured. I guess this makes
> sense for a few different reasons.
>
> I found some internal function
> "google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_request_cpu_usage"
> with the docstring:
>
>     Returns the number of megacycles used so far by this request.
>     Does not include CPU used by API calls.
>
> Calling it, then running time.sleep(5), then calling it again,
> indicates thousands of megacycles used, yet in real terms the CPU was
> probably doing nothing. I guess Datastore CPU, etc., is added on top
> of this, but it seems to suggest to me that if you can drastically
> reduce request time, quota usage should drop too.
>
> I have yet to do any kind of rough measurements of Datastore CPU, so
> I'm not sure how correct this all is.
>
> David.
>
>  - One of the guys on IRC suggested this means that per-request cost
> is scaled during peak usage (and thus internal services running
> slower).
>
> 2009/3/16 peterk :
>
>
>
>
>
> > A couple of questions re. CPU usage..
>
> > "CPU time quota appears to be calculated based on literal time"
>
> > Can you clarify what you mean here? I presume each async request eats
> > into your CPU budget. But you say:
>
> > "since you can burn a whole lot more AppEngine CPU more cheaply using
> > the async api"
>
> > Can you clarify how that's the case?
>
> > I would guess as long as you're being billed for the cpu-ms spent in
> > your asynchronous calls, Google would let you hang yourself with them
> > when it comes to billing.. :) so I presume they'd let you squeeze in
> > as many as your original request, and its limit, will allow for?
>
> > Thanks again.
>
> > On Mar 16, 2:00 pm, David Wilson  wrote:
> >> It's completely undocumented (at this stage, anyway), but definitely
> >> seems to work. A few notes I've come gathered:
>
> >>  - CPU time quota appears to be calculated based on literal time,
> >> rather than e.g. the UNIX concept of "time spent in running state".
>
> >>  - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
> >> Germany using the asynchronous API. I can't begin to imagine how slow
> >> (and therefore expensive in monetary terms) this would be using the
> >> standard API.
>
> >>  - The user-specified callback function appears to be invoked in a
> >> separate thread; the RPC isn't "complete" until this callback
> >> completes. The callback thread is still subject to the request
> >> deadline.
>
> >>  - It's a standard interface, and seems to have no parallel
> >> restrictions at least for urlfetch and Datastore. However, I imagine
> >> that it's possible restrictions may be placed here at some later
> >> stage, since you can burn a whole lot more AppEngine CPU more cheaply
> >> using the async api.
>
> >>  - It's "standard" only insomuch as you have to fiddle with
> >> AppEngine-internal protocolbuffer definitions for each service type.
> >> This mostly means copy-pasting the standard sync call code from the
> >> SDK, and hacking it to use pubsubhubub's proxy code.
>
> >> Per the last point, you might be better waiting for an officially
> >> sanctioned API for doing this, albeit I doubt the protocolbuffer
> >> definitions change all that often.
>
> >> Thanks for Brett Slatkin & co. for doing the digging required to get
> >> the async stuff working! :)
>
> >> David.
>
> >> 2009/3/16 peterk :
>
> >> > Very neat.. Thank you.
>
> >> > Just to clarify, can we use this for all API calls? Datastore too? I
> >> > didn't look very closely at the async proxy in pubsubhubub..
>
> >> > Asynchronous calls available on all apis might give a lot to chew
> >> > on.. :) It's been a while since I've worked with async function calls
> >> > or threading, might have to dig up some old notes to see where I could
> >> > extract gains from it in my app. Some common cases might be worth the
> >> > community documenting for all to benefit from, too.
>
> >> > On Mar 16, 1:26 pm, David Wilson  wrote:
> >> >> I've created a Google Code project to contain some batch utilities I'm
> >> >> working on, based on async_apiproxy.py from pubsubhubbub[0]. The
> >> >> project currently contains just a modified async_apiproxy.py that
> >> >> doesn't require dummy google3 modules on the local machine, and a
> >> >> megafetch.py, for batch-fetching URLs.
>
> >> >>    http://code.google.com/p/appengine-async-tools/
>
> >> >> David
>
> >> >> [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a...
>
> >> >> --
> >> >> It is better to be wrong than to be vague.
> >> >>   — Freeman Dyson
>
> >> --
> >> It is better to be wrong than to be vague.
> >>   — Freeman Dyson
>
> --
> It is better to be wrong than to be vague.
>   — Freeman Dyson
--~--~-~--~~~---~--~~
You received th

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread David Wilson

Joe,

I've only tested it in production. ;)

The code should work serially on the SDK, but I haven't tried yet.


David.

2009/3/16 Joe Bowman :
>
> Does the batch fetching working on live appengine applications, or
> only on the SDK?
>
> On Mar 16, 10:19 am, David Wilson  wrote:
>> I have no idea how definitive this is, but literally it means wall
>> clock time seems to be how CPU cost is measured. I guess this makes
>> sense for a few different reasons.
>>
>> I found some internal function
>> "google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_request_cpu_usage"
>> with the docstring:
>>
>>     Returns the number of megacycles used so far by this request.
>>     Does not include CPU used by API calls.
>>
>> Calling it, then running time.sleep(5), then calling it again,
>> indicates thousands of megacycles used, yet in real terms the CPU was
>> probably doing nothing. I guess Datastore CPU, etc., is added on top
>> of this, but it seems to suggest to me that if you can drastically
>> reduce request time, quota usage should drop too.
>>
>> I have yet to do any kind of rough measurements of Datastore CPU, so
>> I'm not sure how correct this all is.
>>
>> David.
>>
>>  - One of the guys on IRC suggested this means that per-request cost
>> is scaled during peak usage (and thus internal services running
>> slower).
>>
>> 2009/3/16 peterk :
>>
>>
>>
>>
>>
>> > A couple of questions re. CPU usage..
>>
>> > "CPU time quota appears to be calculated based on literal time"
>>
>> > Can you clarify what you mean here? I presume each async request eats
>> > into your CPU budget. But you say:
>>
>> > "since you can burn a whole lot more AppEngine CPU more cheaply using
>> > the async api"
>>
>> > Can you clarify how that's the case?
>>
>> > I would guess as long as you're being billed for the cpu-ms spent in
>> > your asynchronous calls, Google would let you hang yourself with them
>> > when it comes to billing.. :) so I presume they'd let you squeeze in
>> > as many as your original request, and its limit, will allow for?
>>
>> > Thanks again.
>>
>> > On Mar 16, 2:00 pm, David Wilson  wrote:
>> >> It's completely undocumented (at this stage, anyway), but definitely
>> >> seems to work. A few notes I've come gathered:
>>
>> >>  - CPU time quota appears to be calculated based on literal time,
>> >> rather than e.g. the UNIX concept of "time spent in running state".
>>
>> >>  - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
>> >> Germany using the asynchronous API. I can't begin to imagine how slow
>> >> (and therefore expensive in monetary terms) this would be using the
>> >> standard API.
>>
>> >>  - The user-specified callback function appears to be invoked in a
>> >> separate thread; the RPC isn't "complete" until this callback
>> >> completes. The callback thread is still subject to the request
>> >> deadline.
>>
>> >>  - It's a standard interface, and seems to have no parallel
>> >> restrictions at least for urlfetch and Datastore. However, I imagine
>> >> that it's possible restrictions may be placed here at some later
>> >> stage, since you can burn a whole lot more AppEngine CPU more cheaply
>> >> using the async api.
>>
>> >>  - It's "standard" only insomuch as you have to fiddle with
>> >> AppEngine-internal protocolbuffer definitions for each service type.
>> >> This mostly means copy-pasting the standard sync call code from the
>> >> SDK, and hacking it to use pubsubhubub's proxy code.
>>
>> >> Per the last point, you might be better waiting for an officially
>> >> sanctioned API for doing this, albeit I doubt the protocolbuffer
>> >> definitions change all that often.
>>
>> >> Thanks for Brett Slatkin & co. for doing the digging required to get
>> >> the async stuff working! :)
>>
>> >> David.
>>
>> >> 2009/3/16 peterk :
>>
>> >> > Very neat.. Thank you.
>>
>> >> > Just to clarify, can we use this for all API calls? Datastore too? I
>> >> > didn't look very closely at the async proxy in pubsubhubub..
>>
>> >> > Asynchronous calls available on all apis might give a lot to chew
>> >> > on.. :) It's been a while since I've worked with async function calls
>> >> > or threading, might have to dig up some old notes to see where I could
>> >> > extract gains from it in my app. Some common cases might be worth the
>> >> > community documenting for all to benefit from, too.
>>
>> >> > On Mar 16, 1:26 pm, David Wilson  wrote:
>> >> >> I've created a Google Code project to contain some batch utilities I'm
>> >> >> working on, based on async_apiproxy.py from pubsubhubbub[0]. The
>> >> >> project currently contains just a modified async_apiproxy.py that
>> >> >> doesn't require dummy google3 modules on the local machine, and a
>> >> >> megafetch.py, for batch-fetching URLs.
>>
>> >> >>    http://code.google.com/p/appengine-async-tools/
>>
>> >> >> David
>>
>> >> >> [0]http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/async_a...
>>
>> >> >> --
>> >> >> It is better to be wr

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread Joe Bowman

Wow that's great. The SDK might be problematic for you, as it appears
to be very single threaded, I know for a fact it can't reply to
requests to itself.

Out of curiosity, are you still using base urlfetch, or is it your own
creation? While when Google releases their scheduled tasks
functionality it will be less of an issue, if your solution had the
ability to fire off urlfetch calls and not wait for a response, it
could be a perfect fit for the gaeutilities cron utility.

Currently it grabs a list of tasks it's supposed to run on request,
sets a timestamp, runs one, the compares now() to the timestamp and if
the timedelta is more than 1 second, stops running tasks and finishes
the request. It already appears your project would be perfect for
running all necessary tasks at once, and the MIT License I believe is
compatible with the BSD license I've released gaeutilities, so would
you have any personal objection to me including it in gaeutilities at
some point, with proper attribution of course?

If you haven't see that project, it's url is http://gaeutilities.appspot.com/

On Mar 16, 11:03 am, David Wilson  wrote:
> Joe,
>
> I've only tested it in production. ;)
>
> The code should work serially on the SDK, but I haven't tried yet.
>
> David.
>
> 2009/3/16 Joe Bowman :
>
>
>
>
>
> > Does the batch fetching working on live appengine applications, or
> > only on the SDK?
>
> > On Mar 16, 10:19 am, David Wilson  wrote:
> >> I have no idea how definitive this is, but literally it means wall
> >> clock time seems to be how CPU cost is measured. I guess this makes
> >> sense for a few different reasons.
>
> >> I found some internal function
> >> "google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_request_cpu_usage"
> >> with the docstring:
>
> >>     Returns the number of megacycles used so far by this request.
> >>     Does not include CPU used by API calls.
>
> >> Calling it, then running time.sleep(5), then calling it again,
> >> indicates thousands of megacycles used, yet in real terms the CPU was
> >> probably doing nothing. I guess Datastore CPU, etc., is added on top
> >> of this, but it seems to suggest to me that if you can drastically
> >> reduce request time, quota usage should drop too.
>
> >> I have yet to do any kind of rough measurements of Datastore CPU, so
> >> I'm not sure how correct this all is.
>
> >> David.
>
> >>  - One of the guys on IRC suggested this means that per-request cost
> >> is scaled during peak usage (and thus internal services running
> >> slower).
>
> >> 2009/3/16 peterk :
>
> >> > A couple of questions re. CPU usage..
>
> >> > "CPU time quota appears to be calculated based on literal time"
>
> >> > Can you clarify what you mean here? I presume each async request eats
> >> > into your CPU budget. But you say:
>
> >> > "since you can burn a whole lot more AppEngine CPU more cheaply using
> >> > the async api"
>
> >> > Can you clarify how that's the case?
>
> >> > I would guess as long as you're being billed for the cpu-ms spent in
> >> > your asynchronous calls, Google would let you hang yourself with them
> >> > when it comes to billing.. :) so I presume they'd let you squeeze in
> >> > as many as your original request, and its limit, will allow for?
>
> >> > Thanks again.
>
> >> > On Mar 16, 2:00 pm, David Wilson  wrote:
> >> >> It's completely undocumented (at this stage, anyway), but definitely
> >> >> seems to work. A few notes I've come gathered:
>
> >> >>  - CPU time quota appears to be calculated based on literal time,
> >> >> rather than e.g. the UNIX concept of "time spent in running state".
>
> >> >>  - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
> >> >> Germany using the asynchronous API. I can't begin to imagine how slow
> >> >> (and therefore expensive in monetary terms) this would be using the
> >> >> standard API.
>
> >> >>  - The user-specified callback function appears to be invoked in a
> >> >> separate thread; the RPC isn't "complete" until this callback
> >> >> completes. The callback thread is still subject to the request
> >> >> deadline.
>
> >> >>  - It's a standard interface, and seems to have no parallel
> >> >> restrictions at least for urlfetch and Datastore. However, I imagine
> >> >> that it's possible restrictions may be placed here at some later
> >> >> stage, since you can burn a whole lot more AppEngine CPU more cheaply
> >> >> using the async api.
>
> >> >>  - It's "standard" only insomuch as you have to fiddle with
> >> >> AppEngine-internal protocolbuffer definitions for each service type.
> >> >> This mostly means copy-pasting the standard sync call code from the
> >> >> SDK, and hacking it to use pubsubhubub's proxy code.
>
> >> >> Per the last point, you might be better waiting for an officially
> >> >> sanctioned API for doing this, albeit I doubt the protocolbuffer
> >> >> definitions change all that often.
>
> >> >> Thanks for Brett Slatkin & co. for doing the digging required to get
> >> >> th

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread bFlood


@joe - fire/forget - you can just skip the fetcher.wait() call (which
call AsyncAPIProxy.wait). I'm not sure of you would need a valid
callback but even if you did it could be a simple stub that does
nothing.

@david - have you made this work with datastore calls yet? having some
issues trying to figure out how to set pbrequest/pbresponse variables

cheers
brian


On Mar 16, 12:05 pm, Joe Bowman  wrote:
> Wow that's great. The SDK might be problematic for you, as it appears
> to be very single threaded, I know for a fact it can't reply to
> requests to itself.
>
> Out of curiosity, are you still using base urlfetch, or is it your own
> creation? While when Google releases their scheduled tasks
> functionality it will be less of an issue, if your solution had the
> ability to fire off urlfetch calls and not wait for a response, it
> could be a perfect fit for the gaeutilities cron utility.
>
> Currently it grabs a list of tasks it's supposed to run on request,
> sets a timestamp, runs one, the compares now() to the timestamp and if
> the timedelta is more than 1 second, stops running tasks and finishes
> the request. It already appears your project would be perfect for
> running all necessary tasks at once, and the MIT License I believe is
> compatible with the BSD license I've released gaeutilities, so would
> you have any personal objection to me including it in gaeutilities at
> some point, with proper attribution of course?
>
> If you haven't see that project, it's url ishttp://gaeutilities.appspot.com/
>
> On Mar 16, 11:03 am, David Wilson  wrote:
>
> > Joe,
>
> > I've only tested it in production. ;)
>
> > The code should work serially on the SDK, but I haven't tried yet.
>
> > David.
>
> > 2009/3/16 Joe Bowman :
>
> > > Does the batch fetching working on live appengine applications, or
> > > only on the SDK?
>
> > > On Mar 16, 10:19 am, David Wilson  wrote:
> > >> I have no idea how definitive this is, but literally it means wall
> > >> clock time seems to be how CPU cost is measured. I guess this makes
> > >> sense for a few different reasons.
>
> > >> I found some internal function
> > >> "google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ
> > >>  est_cpu_usage"
> > >> with the docstring:
>
> > >>     Returns the number of megacycles used so far by this request.
> > >>     Does not include CPU used by API calls.
>
> > >> Calling it, then running time.sleep(5), then calling it again,
> > >> indicates thousands of megacycles used, yet in real terms the CPU was
> > >> probably doing nothing. I guess Datastore CPU, etc., is added on top
> > >> of this, but it seems to suggest to me that if you can drastically
> > >> reduce request time, quota usage should drop too.
>
> > >> I have yet to do any kind of rough measurements of Datastore CPU, so
> > >> I'm not sure how correct this all is.
>
> > >> David.
>
> > >>  - One of the guys on IRC suggested this means that per-request cost
> > >> is scaled during peak usage (and thus internal services running
> > >> slower).
>
> > >> 2009/3/16 peterk :
>
> > >> > A couple of questions re. CPU usage..
>
> > >> > "CPU time quota appears to be calculated based on literal time"
>
> > >> > Can you clarify what you mean here? I presume each async request eats
> > >> > into your CPU budget. But you say:
>
> > >> > "since you can burn a whole lot more AppEngine CPU more cheaply using
> > >> > the async api"
>
> > >> > Can you clarify how that's the case?
>
> > >> > I would guess as long as you're being billed for the cpu-ms spent in
> > >> > your asynchronous calls, Google would let you hang yourself with them
> > >> > when it comes to billing.. :) so I presume they'd let you squeeze in
> > >> > as many as your original request, and its limit, will allow for?
>
> > >> > Thanks again.
>
> > >> > On Mar 16, 2:00 pm, David Wilson  wrote:
> > >> >> It's completely undocumented (at this stage, anyway), but definitely
> > >> >> seems to work. A few notes I've come gathered:
>
> > >> >>  - CPU time quota appears to be calculated based on literal time,
> > >> >> rather than e.g. the UNIX concept of "time spent in running state".
>
> > >> >>  - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
> > >> >> Germany using the asynchronous API. I can't begin to imagine how slow
> > >> >> (and therefore expensive in monetary terms) this would be using the
> > >> >> standard API.
>
> > >> >>  - The user-specified callback function appears to be invoked in a
> > >> >> separate thread; the RPC isn't "complete" until this callback
> > >> >> completes. The callback thread is still subject to the request
> > >> >> deadline.
>
> > >> >>  - It's a standard interface, and seems to have no parallel
> > >> >> restrictions at least for urlfetch and Datastore. However, I imagine
> > >> >> that it's possible restrictions may be placed here at some later
> > >> >> stage, since you can burn a whole lot more AppEngine CPU more cheaply
> > >> >> using the async api.

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread David Wilson

I forgot to mention, AppEngine does not close the request until all
asynchronous requests have ended. This means it's not truly "fire and
forget". Regardless of whether you're waiting for a response or not,
if a request is in progress, the HTTP response body is not returned to
the client.

I created a simple function this morning to call datastore_v3.Delete
on a set of key objects, it appeared to work but I didn't test beyond
ensuring the callback didn't receive an exception. Pretty untested
code here: .

For simple uses, it's probably not all that useful to call Datastore
asynchronously is all that useful anyway, since unlike urlfetch, you
can already minimize latency by making batch calls at the start/end of
your request for all the keys you want to load/save. It's possibly
useful to use it to concurrently commit a bunch of different
transactions, but the code for this is less trivial than the urlfetch
case. Probably best to see what the AppEngine team themselves provide
for this. ;)


David.

2009/3/16 bFlood :
>
>
> @joe - fire/forget - you can just skip the fetcher.wait() call (which
> call AsyncAPIProxy.wait). I'm not sure of you would need a valid
> callback but even if you did it could be a simple stub that does
> nothing.
>
> @david - have you made this work with datastore calls yet? having some
> issues trying to figure out how to set pbrequest/pbresponse variables
>
> cheers
> brian
>
>
> On Mar 16, 12:05 pm, Joe Bowman  wrote:
>> Wow that's great. The SDK might be problematic for you, as it appears
>> to be very single threaded, I know for a fact it can't reply to
>> requests to itself.
>>
>> Out of curiosity, are you still using base urlfetch, or is it your own
>> creation? While when Google releases their scheduled tasks
>> functionality it will be less of an issue, if your solution had the
>> ability to fire off urlfetch calls and not wait for a response, it
>> could be a perfect fit for the gaeutilities cron utility.
>>
>> Currently it grabs a list of tasks it's supposed to run on request,
>> sets a timestamp, runs one, the compares now() to the timestamp and if
>> the timedelta is more than 1 second, stops running tasks and finishes
>> the request. It already appears your project would be perfect for
>> running all necessary tasks at once, and the MIT License I believe is
>> compatible with the BSD license I've released gaeutilities, so would
>> you have any personal objection to me including it in gaeutilities at
>> some point, with proper attribution of course?
>>
>> If you haven't see that project, it's url ishttp://gaeutilities.appspot.com/
>>
>> On Mar 16, 11:03 am, David Wilson  wrote:
>>
>> > Joe,
>>
>> > I've only tested it in production. ;)
>>
>> > The code should work serially on the SDK, but I haven't tried yet.
>>
>> > David.
>>
>> > 2009/3/16 Joe Bowman :
>>
>> > > Does the batch fetching working on live appengine applications, or
>> > > only on the SDK?
>>
>> > > On Mar 16, 10:19 am, David Wilson  wrote:
>> > >> I have no idea how definitive this is, but literally it means wall
>> > >> clock time seems to be how CPU cost is measured. I guess this makes
>> > >> sense for a few different reasons.
>>
>> > >> I found some internal function
>> > >> "google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ
>> > >>  est_cpu_usage"
>> > >> with the docstring:
>>
>> > >>     Returns the number of megacycles used so far by this request.
>> > >>     Does not include CPU used by API calls.
>>
>> > >> Calling it, then running time.sleep(5), then calling it again,
>> > >> indicates thousands of megacycles used, yet in real terms the CPU was
>> > >> probably doing nothing. I guess Datastore CPU, etc., is added on top
>> > >> of this, but it seems to suggest to me that if you can drastically
>> > >> reduce request time, quota usage should drop too.
>>
>> > >> I have yet to do any kind of rough measurements of Datastore CPU, so
>> > >> I'm not sure how correct this all is.
>>
>> > >> David.
>>
>> > >>  - One of the guys on IRC suggested this means that per-request cost
>> > >> is scaled during peak usage (and thus internal services running
>> > >> slower).
>>
>> > >> 2009/3/16 peterk :
>>
>> > >> > A couple of questions re. CPU usage..
>>
>> > >> > "CPU time quota appears to be calculated based on literal time"
>>
>> > >> > Can you clarify what you mean here? I presume each async request eats
>> > >> > into your CPU budget. But you say:
>>
>> > >> > "since you can burn a whole lot more AppEngine CPU more cheaply using
>> > >> > the async api"
>>
>> > >> > Can you clarify how that's the case?
>>
>> > >> > I would guess as long as you're being billed for the cpu-ms spent in
>> > >> > your asynchronous calls, Google would let you hang yourself with them
>> > >> > when it comes to billing.. :) so I presume they'd let you squeeze in
>> > >> > as many as your original request, and its limit, will allow for?
>>
>> > >> > Thanks again.
>>
>> > >> > On Mar 

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread Joe Bowman

I imagine keeping the request open until everything is done isn't
going to go away any time soon, it's how http responses work and the
scheduled tasks on the roadmap would be better suited to providing
better support for that. I also agree on the batch put and get
functionality for the most part is there.

My experience from mass delete scripts has been delete is extremely
heavy, and before the runtime length was extended, I came up with the
number 75 being the safe amount of entities to delete in a request
without encountering timeouts for the most part. I ended up using
javascript with a simple protocol (responses of "there's more" and
"all done" in order to delete 10k+ objects at a time). During that
time I did notice that repeated writing to the datastore (or delete in
my case) also caused other errors, which it looked like I was being
throttled, so that's something else you may encounter if you continue
to work on asynchronous datastore calls.

On Mar 16, 1:12 pm, David Wilson  wrote:
> I forgot to mention, AppEngine does not close the request until all
> asynchronous requests have ended. This means it's not truly "fire and
> forget". Regardless of whether you're waiting for a response or not,
> if a request is in progress, the HTTP response body is not returned to
> the client.
>
> I created a simple function this morning to call datastore_v3.Delete
> on a set of key objects, it appeared to work but I didn't test beyond
> ensuring the callback didn't receive an exception. Pretty untested
> code here: .
>
> For simple uses, it's probably not all that useful to call Datastore
> asynchronously is all that useful anyway, since unlike urlfetch, you
> can already minimize latency by making batch calls at the start/end of
> your request for all the keys you want to load/save. It's possibly
> useful to use it to concurrently commit a bunch of different
> transactions, but the code for this is less trivial than the urlfetch
> case. Probably best to see what the AppEngine team themselves provide
> for this. ;)
>
> David.
>
> 2009/3/16 bFlood :
>
>
>
>
>
> > @joe - fire/forget - you can just skip the fetcher.wait() call (which
> > call AsyncAPIProxy.wait). I'm not sure of you would need a valid
> > callback but even if you did it could be a simple stub that does
> > nothing.
>
> > @david - have you made this work with datastore calls yet? having some
> > issues trying to figure out how to set pbrequest/pbresponse variables
>
> > cheers
> > brian
>
> > On Mar 16, 12:05 pm, Joe Bowman  wrote:
> >> Wow that's great. The SDK might be problematic for you, as it appears
> >> to be very single threaded, I know for a fact it can't reply to
> >> requests to itself.
>
> >> Out of curiosity, are you still using base urlfetch, or is it your own
> >> creation? While when Google releases their scheduled tasks
> >> functionality it will be less of an issue, if your solution had the
> >> ability to fire off urlfetch calls and not wait for a response, it
> >> could be a perfect fit for the gaeutilities cron utility.
>
> >> Currently it grabs a list of tasks it's supposed to run on request,
> >> sets a timestamp, runs one, the compares now() to the timestamp and if
> >> the timedelta is more than 1 second, stops running tasks and finishes
> >> the request. It already appears your project would be perfect for
> >> running all necessary tasks at once, and the MIT License I believe is
> >> compatible with the BSD license I've released gaeutilities, so would
> >> you have any personal objection to me including it in gaeutilities at
> >> some point, with proper attribution of course?
>
> >> If you haven't see that project, it's url 
> >> ishttp://gaeutilities.appspot.com/
>
> >> On Mar 16, 11:03 am, David Wilson  wrote:
>
> >> > Joe,
>
> >> > I've only tested it in production. ;)
>
> >> > The code should work serially on the SDK, but I haven't tried yet.
>
> >> > David.
>
> >> > 2009/3/16 Joe Bowman :
>
> >> > > Does the batch fetching working on live appengine applications, or
> >> > > only on the SDK?
>
> >> > > On Mar 16, 10:19 am, David Wilson  wrote:
> >> > >> I have no idea how definitive this is, but literally it means wall
> >> > >> clock time seems to be how CPU cost is measured. I guess this makes
> >> > >> sense for a few different reasons.
>
> >> > >> I found some internal function
> >> > >> "google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ
> >> > >>  est_cpu_usage"
> >> > >> with the docstring:
>
> >> > >>     Returns the number of megacycles used so far by this request.
> >> > >>     Does not include CPU used by API calls.
>
> >> > >> Calling it, then running time.sleep(5), then calling it again,
> >> > >> indicates thousands of megacycles used, yet in real terms the CPU was
> >> > >> probably doing nothing. I guess Datastore CPU, etc., is added on top
> >> > >> of this, but it seems to suggest to me that if you can drastically
> >> > >> reduce request time, quota usa

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-16 Thread bFlood

thanks david.

agreed on datastore except that unlike the current batch calls, you
might be able to execute code concurrently on each response and then
wait for all the worker's results. to me, and I could be wrong, even a
no-op datastore request could serve as a poor man's worker thread.
I'll see if I can get it working on our stuff and report back (did you
happen to notice if all the threads were started on the same machine?)

regardless, it will just be testing for right now. I'm sure the GAE
team has their own ideas about whats allowed with async access.

cheers and thanks again
brian





On Mar 16, 1:12 pm, David Wilson  wrote:
> I forgot to mention, AppEngine does not close the request until all
> asynchronous requests have ended. This means it's not truly "fire and
> forget". Regardless of whether you're waiting for a response or not,
> if a request is in progress, the HTTP response body is not returned to
> the client.
>
> I created a simple function this morning to call datastore_v3.Delete
> on a set of key objects, it appeared to work but I didn't test beyond
> ensuring the callback didn't receive an exception. Pretty untested
> code here: .
>
> For simple uses, it's probably not all that useful to call Datastore
> asynchronously is all that useful anyway, since unlike urlfetch, you
> can already minimize latency by making batch calls at the start/end of
> your request for all the keys you want to load/save. It's possibly
> useful to use it to concurrently commit a bunch of different
> transactions, but the code for this is less trivial than the urlfetch
> case. Probably best to see what the AppEngine team themselves provide
> for this. ;)
>
> David.
>
> 2009/3/16 bFlood :
>
>
>
>
>
> > @joe - fire/forget - you can just skip the fetcher.wait() call (which
> > call AsyncAPIProxy.wait). I'm not sure of you would need a valid
> > callback but even if you did it could be a simple stub that does
> > nothing.
>
> > @david - have you made this work with datastore calls yet? having some
> > issues trying to figure out how to set pbrequest/pbresponse variables
>
> > cheers
> > brian
>
> > On Mar 16, 12:05 pm, Joe Bowman  wrote:
> >> Wow that's great. The SDK might be problematic for you, as it appears
> >> to be very single threaded, I know for a fact it can't reply to
> >> requests to itself.
>
> >> Out of curiosity, are you still using base urlfetch, or is it your own
> >> creation? While when Google releases their scheduled tasks
> >> functionality it will be less of an issue, if your solution had the
> >> ability to fire off urlfetch calls and not wait for a response, it
> >> could be a perfect fit for the gaeutilities cron utility.
>
> >> Currently it grabs a list of tasks it's supposed to run on request,
> >> sets a timestamp, runs one, the compares now() to the timestamp and if
> >> the timedelta is more than 1 second, stops running tasks and finishes
> >> the request. It already appears your project would be perfect for
> >> running all necessary tasks at once, and the MIT License I believe is
> >> compatible with the BSD license I've released gaeutilities, so would
> >> you have any personal objection to me including it in gaeutilities at
> >> some point, with proper attribution of course?
>
> >> If you haven't see that project, it's url 
> >> ishttp://gaeutilities.appspot.com/
>
> >> On Mar 16, 11:03 am, David Wilson  wrote:
>
> >> > Joe,
>
> >> > I've only tested it in production. ;)
>
> >> > The code should work serially on the SDK, but I haven't tried yet.
>
> >> > David.
>
> >> > 2009/3/16 Joe Bowman :
>
> >> > > Does the batch fetching working on live appengine applications, or
> >> > > only on the SDK?
>
> >> > > On Mar 16, 10:19 am, David Wilson  wrote:
> >> > >> I have no idea how definitive this is, but literally it means wall
> >> > >> clock time seems to be how CPU cost is measured. I guess this makes
> >> > >> sense for a few different reasons.
>
> >> > >> I found some internal function
> >> > >> "google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ
> >> > >>  est_cpu_usage"
> >> > >> with the docstring:
>
> >> > >>     Returns the number of megacycles used so far by this request.
> >> > >>     Does not include CPU used by API calls.
>
> >> > >> Calling it, then running time.sleep(5), then calling it again,
> >> > >> indicates thousands of megacycles used, yet in real terms the CPU was
> >> > >> probably doing nothing. I guess Datastore CPU, etc., is added on top
> >> > >> of this, but it seems to suggest to me that if you can drastically
> >> > >> reduce request time, quota usage should drop too.
>
> >> > >> I have yet to do any kind of rough measurements of Datastore CPU, so
> >> > >> I'm not sure how correct this all is.
>
> >> > >> David.
>
> >> > >>  - One of the guys on IRC suggested this means that per-request cost
> >> > >> is scaled during peak usage (and thus internal services running
> >> > >> slower).
>
> >> > >> 2009/3/16

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-17 Thread David Wilson

2009/3/16 Joe Bowman :
>
> Wow that's great. The SDK might be problematic for you, as it appears
> to be very single threaded, I know for a fact it can't reply to
> requests to itself.
>
> Out of curiosity, are you still using base urlfetch, or is it your own
> creation? While when Google releases their scheduled tasks
> functionality it will be less of an issue, if your solution had the
> ability to fire off urlfetch calls and not wait for a response, it
> could be a perfect fit for the gaeutilities cron utility.
>
> Currently it grabs a list of tasks it's supposed to run on request,
> sets a timestamp, runs one, the compares now() to the timestamp and if
> the timedelta is more than 1 second, stops running tasks and finishes
> the request. It already appears your project would be perfect for
> running all necessary tasks at once, and the MIT License I believe is
> compatible with the BSD license I've released gaeutilities, so would
> you have any personal objection to me including it in gaeutilities at
> some point, with proper attribution of course?

Sorry I missed this in the first reply - yeah work away! :)


David

>
> If you haven't see that project, it's url is http://gaeutilities.appspot.com/
>
> On Mar 16, 11:03 am, David Wilson  wrote:
>> Joe,
>>
>> I've only tested it in production. ;)
>>
>> The code should work serially on the SDK, but I haven't tried yet.
>>
>> David.
>>
>> 2009/3/16 Joe Bowman :
>>
>>
>>
>>
>>
>> > Does the batch fetching working on live appengine applications, or
>> > only on the SDK?
>>
>> > On Mar 16, 10:19 am, David Wilson  wrote:
>> >> I have no idea how definitive this is, but literally it means wall
>> >> clock time seems to be how CPU cost is measured. I guess this makes
>> >> sense for a few different reasons.
>>
>> >> I found some internal function
>> >> "google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_request_cpu_usage"
>> >> with the docstring:
>>
>> >>     Returns the number of megacycles used so far by this request.
>> >>     Does not include CPU used by API calls.
>>
>> >> Calling it, then running time.sleep(5), then calling it again,
>> >> indicates thousands of megacycles used, yet in real terms the CPU was
>> >> probably doing nothing. I guess Datastore CPU, etc., is added on top
>> >> of this, but it seems to suggest to me that if you can drastically
>> >> reduce request time, quota usage should drop too.
>>
>> >> I have yet to do any kind of rough measurements of Datastore CPU, so
>> >> I'm not sure how correct this all is.
>>
>> >> David.
>>
>> >>  - One of the guys on IRC suggested this means that per-request cost
>> >> is scaled during peak usage (and thus internal services running
>> >> slower).
>>
>> >> 2009/3/16 peterk :
>>
>> >> > A couple of questions re. CPU usage..
>>
>> >> > "CPU time quota appears to be calculated based on literal time"
>>
>> >> > Can you clarify what you mean here? I presume each async request eats
>> >> > into your CPU budget. But you say:
>>
>> >> > "since you can burn a whole lot more AppEngine CPU more cheaply using
>> >> > the async api"
>>
>> >> > Can you clarify how that's the case?
>>
>> >> > I would guess as long as you're being billed for the cpu-ms spent in
>> >> > your asynchronous calls, Google would let you hang yourself with them
>> >> > when it comes to billing.. :) so I presume they'd let you squeeze in
>> >> > as many as your original request, and its limit, will allow for?
>>
>> >> > Thanks again.
>>
>> >> > On Mar 16, 2:00 pm, David Wilson  wrote:
>> >> >> It's completely undocumented (at this stage, anyway), but definitely
>> >> >> seems to work. A few notes I've come gathered:
>>
>> >> >>  - CPU time quota appears to be calculated based on literal time,
>> >> >> rather than e.g. the UNIX concept of "time spent in running state".
>>
>> >> >>  - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
>> >> >> Germany using the asynchronous API. I can't begin to imagine how slow
>> >> >> (and therefore expensive in monetary terms) this would be using the
>> >> >> standard API.
>>
>> >> >>  - The user-specified callback function appears to be invoked in a
>> >> >> separate thread; the RPC isn't "complete" until this callback
>> >> >> completes. The callback thread is still subject to the request
>> >> >> deadline.
>>
>> >> >>  - It's a standard interface, and seems to have no parallel
>> >> >> restrictions at least for urlfetch and Datastore. However, I imagine
>> >> >> that it's possible restrictions may be placed here at some later
>> >> >> stage, since you can burn a whole lot more AppEngine CPU more cheaply
>> >> >> using the async api.
>>
>> >> >>  - It's "standard" only insomuch as you have to fiddle with
>> >> >> AppEngine-internal protocolbuffer definitions for each service type.
>> >> >> This mostly means copy-pasting the standard sync call code from the
>> >> >> SDK, and hacking it to use pubsubhubub's proxy code.
>>
>> >> >> Per the last point, you might be better wa

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-17 Thread Joe Bowman

Thanks,

I'm going to give it a go for urlfetch calls for one project I'm
working on this week.

Not sure when I'd be able to include it in gaeutiltiies for cron and
such, that project is currently lower on my priority list at the
moment, but can't wait until I get a chance to play with it. Another
idea I had for it is the ROTmodel (retry on timeout model) in the
project, which could speed that process up.

On Mar 17, 9:11 am, David Wilson  wrote:
> 2009/3/16 Joe Bowman :
>
>
>
>
>
> > Wow that's great. The SDK might be problematic for you, as it appears
> > to be very single threaded, I know for a fact it can't reply to
> > requests to itself.
>
> > Out of curiosity, are you still using base urlfetch, or is it your own
> > creation? While when Google releases their scheduled tasks
> > functionality it will be less of an issue, if your solution had the
> > ability to fire off urlfetch calls and not wait for a response, it
> > could be a perfect fit for the gaeutilities cron utility.
>
> > Currently it grabs a list of tasks it's supposed to run on request,
> > sets a timestamp, runs one, the compares now() to the timestamp and if
> > the timedelta is more than 1 second, stops running tasks and finishes
> > the request. It already appears your project would be perfect for
> > running all necessary tasks at once, and the MIT License I believe is
> > compatible with the BSD license I've released gaeutilities, so would
> > you have any personal objection to me including it in gaeutilities at
> > some point, with proper attribution of course?
>
> Sorry I missed this in the first reply - yeah work away! :)
>
> David
>
>
>
>
>
> > If you haven't see that project, it's url ishttp://gaeutilities.appspot.com/
>
> > On Mar 16, 11:03 am, David Wilson  wrote:
> >> Joe,
>
> >> I've only tested it in production. ;)
>
> >> The code should work serially on the SDK, but I haven't tried yet.
>
> >> David.
>
> >> 2009/3/16 Joe Bowman :
>
> >> > Does the batch fetching working on live appengine applications, or
> >> > only on the SDK?
>
> >> > On Mar 16, 10:19 am, David Wilson  wrote:
> >> >> I have no idea how definitive this is, but literally it means wall
> >> >> clock time seems to be how CPU cost is measured. I guess this makes
> >> >> sense for a few different reasons.
>
> >> >> I found some internal function
> >> >> "google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_request_cpu_usage"
> >> >> with the docstring:
>
> >> >>     Returns the number of megacycles used so far by this request.
> >> >>     Does not include CPU used by API calls.
>
> >> >> Calling it, then running time.sleep(5), then calling it again,
> >> >> indicates thousands of megacycles used, yet in real terms the CPU was
> >> >> probably doing nothing. I guess Datastore CPU, etc., is added on top
> >> >> of this, but it seems to suggest to me that if you can drastically
> >> >> reduce request time, quota usage should drop too.
>
> >> >> I have yet to do any kind of rough measurements of Datastore CPU, so
> >> >> I'm not sure how correct this all is.
>
> >> >> David.
>
> >> >>  - One of the guys on IRC suggested this means that per-request cost
> >> >> is scaled during peak usage (and thus internal services running
> >> >> slower).
>
> >> >> 2009/3/16 peterk :
>
> >> >> > A couple of questions re. CPU usage..
>
> >> >> > "CPU time quota appears to be calculated based on literal time"
>
> >> >> > Can you clarify what you mean here? I presume each async request eats
> >> >> > into your CPU budget. But you say:
>
> >> >> > "since you can burn a whole lot more AppEngine CPU more cheaply using
> >> >> > the async api"
>
> >> >> > Can you clarify how that's the case?
>
> >> >> > I would guess as long as you're being billed for the cpu-ms spent in
> >> >> > your asynchronous calls, Google would let you hang yourself with them
> >> >> > when it comes to billing.. :) so I presume they'd let you squeeze in
> >> >> > as many as your original request, and its limit, will allow for?
>
> >> >> > Thanks again.
>
> >> >> > On Mar 16, 2:00 pm, David Wilson  wrote:
> >> >> >> It's completely undocumented (at this stage, anyway), but definitely
> >> >> >> seems to work. A few notes I've come gathered:
>
> >> >> >>  - CPU time quota appears to be calculated based on literal time,
> >> >> >> rather than e.g. the UNIX concept of "time spent in running state".
>
> >> >> >>  - I can fetch 100 URLs in 1.3 seconds from a machine colocated in
> >> >> >> Germany using the asynchronous API. I can't begin to imagine how slow
> >> >> >> (and therefore expensive in monetary terms) this would be using the
> >> >> >> standard API.
>
> >> >> >>  - The user-specified callback function appears to be invoked in a
> >> >> >> separate thread; the RPC isn't "complete" until this callback
> >> >> >> completes. The callback thread is still subject to the request
> >> >> >> deadline.
>
> >> >> >>  - It's a standard interface, and seems to have no parallel
> >> >> >> restrictions 

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-17 Thread Joe Bowman

This may be a really dumb question, but.. I'm still learning so...

Is there a way to do something other than a direct api call
asynchronously? I'm writing a script that pulls from multiple sources,
sometimes with higher level calls that use urlfetch, such as gdata.
Since I'm attempting to pull from multiple sources, and sometimes
multiple urls from each source, I'm trying to figure out if it's
possible to run other methods at the same time.

For example, I want to pull a youtube entry for several different
authors. The youtube api doesn't allow multiple authors in a request
(I have a enhancement request in for that though), so I need to do a
yt_service.GetYouTubeVideoFeed() for each author, then splice them
together into one feed. As I'm also working with Boss, and eventually
Twitter, I'll have feeds to pull from those sources as well.

My current application layout is using appengine-patch to provide
django. I've set up a Boss and Youtube "model" with get methods that
handle getting the data. So I can do something similar to:

web_results = models.Boss.get(request.GET['term'], start=start)
news_results = models.Boss.get(request.GET['term'], vertical="news",
start=start)
youtube = models.Youtube.get(request.GET['term'], start=start)

Ideally, I'd like some of those models to be able to do asynchronous
tasks within their get function, and then also, I'd like to run the
above requests at the same, which should really speed the request up.


On Mar 17, 9:20 am, Joe Bowman  wrote:
> Thanks,
>
> I'm going to give it a go for urlfetch calls for one project I'm
> working on this week.
>
> Not sure when I'd be able to include it in gaeutiltiies for cron and
> such, that project is currently lower on my priority list at the
> moment, but can't wait until I get a chance to play with it. Another
> idea I had for it is the ROTmodel (retry on timeout model) in the
> project, which could speed that process up.
>
> On Mar 17, 9:11 am, David Wilson  wrote:
>
> > 2009/3/16 Joe Bowman :
>
> > > Wow that's great. The SDK might be problematic for you, as it appears
> > > to be very single threaded, I know for a fact it can't reply to
> > > requests to itself.
>
> > > Out of curiosity, are you still using base urlfetch, or is it your own
> > > creation? While when Google releases their scheduled tasks
> > > functionality it will be less of an issue, if your solution had the
> > > ability to fire off urlfetch calls and not wait for a response, it
> > > could be a perfect fit for the gaeutilities cron utility.
>
> > > Currently it grabs a list of tasks it's supposed to run on request,
> > > sets a timestamp, runs one, the compares now() to the timestamp and if
> > > the timedelta is more than 1 second, stops running tasks and finishes
> > > the request. It already appears your project would be perfect for
> > > running all necessary tasks at once, and the MIT License I believe is
> > > compatible with the BSD license I've released gaeutilities, so would
> > > you have any personal objection to me including it in gaeutilities at
> > > some point, with proper attribution of course?
>
> > Sorry I missed this in the first reply - yeah work away! :)
>
> > David
>
> > > If you haven't see that project, it's url 
> > > ishttp://gaeutilities.appspot.com/
>
> > > On Mar 16, 11:03 am, David Wilson  wrote:
> > >> Joe,
>
> > >> I've only tested it in production. ;)
>
> > >> The code should work serially on the SDK, but I haven't tried yet.
>
> > >> David.
>
> > >> 2009/3/16 Joe Bowman :
>
> > >> > Does the batch fetching working on live appengine applications, or
> > >> > only on the SDK?
>
> > >> > On Mar 16, 10:19 am, David Wilson  wrote:
> > >> >> I have no idea how definitive this is, but literally it means wall
> > >> >> clock time seems to be how CPU cost is measured. I guess this makes
> > >> >> sense for a few different reasons.
>
> > >> >> I found some internal function
> > >> >> "google3.apphosting.runtime._apphosting_runtime___python__apiproxy.get_requ
> > >> >>  est_cpu_usage"
> > >> >> with the docstring:
>
> > >> >>     Returns the number of megacycles used so far by this request.
> > >> >>     Does not include CPU used by API calls.
>
> > >> >> Calling it, then running time.sleep(5), then calling it again,
> > >> >> indicates thousands of megacycles used, yet in real terms the CPU was
> > >> >> probably doing nothing. I guess Datastore CPU, etc., is added on top
> > >> >> of this, but it seems to suggest to me that if you can drastically
> > >> >> reduce request time, quota usage should drop too.
>
> > >> >> I have yet to do any kind of rough measurements of Datastore CPU, so
> > >> >> I'm not sure how correct this all is.
>
> > >> >> David.
>
> > >> >>  - One of the guys on IRC suggested this means that per-request cost
> > >> >> is scaled during peak usage (and thus internal services running
> > >> >> slower).
>
> > >> >> 2009/3/16 peterk :
>
> > >> >> > A couple of questions re. CPU usage..
>
> > >> >> > "CPU time quota 

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-18 Thread David Wilson

Hey Joe,

With the gdata package you can do something like this instead:


As usual, completely untested code, but looks about right..


from youtube import YouTubeVideoFeedFromString


def get_feeds_async(usernames):
fetcher = megafetch.Fetcher()
output = {}

def cb(username, result):
if isinstance(output, Exception):
logging.error('could not fetch: %s', output)
content = None
else:
content = YouTubeVideoFeedFromString(result.content)
output[username] = content

for username in usernames:
url = 'http://gdata.youtube.com/feeds/api/users/%s/uploads' %\
(username,)
fetcher.start(url, lambda result: cb(username, result))

fetcher.wait()
return output


feeds = get_feeds_async([ 'davemw', 'waverlyflams', 'googletechtalks',
  'TheOnion', 'winterelaxation' ])

# feeds is now a mapping of usernames to YouTubeVideoFeed instances,
or None if could not be fetched.


2009/3/18 Joe Bowman :
>
> This may be a really dumb question, but.. I'm still learning so...
>
> Is there a way to do something other than a direct api call
> asynchronously? I'm writing a script that pulls from multiple sources,
> sometimes with higher level calls that use urlfetch, such as gdata.
> Since I'm attempting to pull from multiple sources, and sometimes
> multiple urls from each source, I'm trying to figure out if it's
> possible to run other methods at the same time.
>
> For example, I want to pull a youtube entry for several different
> authors. The youtube api doesn't allow multiple authors in a request
> (I have a enhancement request in for that though), so I need to do a
> yt_service.GetYouTubeVideoFeed() for each author, then splice them
> together into one feed. As I'm also working with Boss, and eventually
> Twitter, I'll have feeds to pull from those sources as well.
>
> My current application layout is using appengine-patch to provide
> django. I've set up a Boss and Youtube "model" with get methods that
> handle getting the data. So I can do something similar to:
>
> web_results = models.Boss.get(request.GET['term'], start=start)
> news_results = models.Boss.get(request.GET['term'], vertical="news",
> start=start)
> youtube = models.Youtube.get(request.GET['term'], start=start)
>
> Ideally, I'd like some of those models to be able to do asynchronous
> tasks within their get function, and then also, I'd like to run the
> above requests at the same, which should really speed the request up.
>
>
> On Mar 17, 9:20 am, Joe Bowman  wrote:
>> Thanks,
>>
>> I'm going to give it a go for urlfetch calls for one project I'm
>> working on this week.
>>
>> Not sure when I'd be able to include it in gaeutiltiies for cron and
>> such, that project is currently lower on my priority list at the
>> moment, but can't wait until I get a chance to play with it. Another
>> idea I had for it is the ROTmodel (retry on timeout model) in the
>> project, which could speed that process up.
>>
>> On Mar 17, 9:11 am, David Wilson  wrote:
>>
>> > 2009/3/16 Joe Bowman :
>>
>> > > Wow that's great. The SDK might be problematic for you, as it appears
>> > > to be very single threaded, I know for a fact it can't reply to
>> > > requests to itself.
>>
>> > > Out of curiosity, are you still using base urlfetch, or is it your own
>> > > creation? While when Google releases their scheduled tasks
>> > > functionality it will be less of an issue, if your solution had the
>> > > ability to fire off urlfetch calls and not wait for a response, it
>> > > could be a perfect fit for the gaeutilities cron utility.
>>
>> > > Currently it grabs a list of tasks it's supposed to run on request,
>> > > sets a timestamp, runs one, the compares now() to the timestamp and if
>> > > the timedelta is more than 1 second, stops running tasks and finishes
>> > > the request. It already appears your project would be perfect for
>> > > running all necessary tasks at once, and the MIT License I believe is
>> > > compatible with the BSD license I've released gaeutilities, so would
>> > > you have any personal objection to me including it in gaeutilities at
>> > > some point, with proper attribution of course?
>>
>> > Sorry I missed this in the first reply - yeah work away! :)
>>
>> > David
>>
>> > > If you haven't see that project, it's url 
>> > > ishttp://gaeutilities.appspot.com/
>>
>> > > On Mar 16, 11:03 am, David Wilson  wrote:
>> > >> Joe,
>>
>> > >> I've only tested it in production. ;)
>>
>> > >> The code should work serially on the SDK, but I haven't tried yet.
>>
>> > >> David.
>>
>> > >> 2009/3/16 Joe Bowman :
>>
>> > >> > Does the batch fetching working on live appengine applications, or
>> > >> > only on the SDK?
>>
>> > >> > On Mar 16, 10:19 am, David Wilson  wrote:
>> > >> >> I have no idea how definitive this is, but literally it means wall
>> > >> >> clock time seems to be how CPU cost is measured. I guess this makes
>> > >> >> sense for a fe

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-18 Thread Joe Bowman

Ah ha.. thanks David.

And for the views, if I really wanted to launch everything at once, I
could map my boss, youtube, twitter, etc etc pulls to their own urls,
and use megafetch in my master view to pull those urls all at once
too.

On Mar 18, 5:14 am, David Wilson  wrote:
> Hey Joe,
>
> With the gdata package you can do something like this instead:
>
> As usual, completely untested code, but looks about right..
>
> from youtube import YouTubeVideoFeedFromString
>
> def get_feeds_async(usernames):
>     fetcher = megafetch.Fetcher()
>     output = {}
>
>     def cb(username, result):
>         if isinstance(output, Exception):
>             logging.error('could not fetch: %s', output)
>             content = None
>         else:
>             content = YouTubeVideoFeedFromString(result.content)
>         output[username] = content
>
>     for username in usernames:
>         url = 'http://gdata.youtube.com/feeds/api/users/%s/uploads'%\
>             (username,)
>         fetcher.start(url, lambda result: cb(username, result))
>
>     fetcher.wait()
>     return output
>
> feeds = get_feeds_async([ 'davemw', 'waverlyflams', 'googletechtalks',
>                           'TheOnion', 'winterelaxation' ])
>
> # feeds is now a mapping of usernames to YouTubeVideoFeed instances,
> or None if could not be fetched.
>
> 2009/3/18 Joe Bowman :
>
>
>
> > This may be a really dumb question, but.. I'm still learning so...
>
> > Is there a way to do something other than a direct api call
> > asynchronously? I'm writing a script that pulls from multiple sources,
> > sometimes with higher level calls that use urlfetch, such as gdata.
> > Since I'm attempting to pull from multiple sources, and sometimes
> > multiple urls from each source, I'm trying to figure out if it's
> > possible to run other methods at the same time.
>
> > For example, I want to pull a youtube entry for several different
> > authors. The youtube api doesn't allow multiple authors in a request
> > (I have a enhancement request in for that though), so I need to do a
> > yt_service.GetYouTubeVideoFeed() for each author, then splice them
> > together into one feed. As I'm also working with Boss, and eventually
> > Twitter, I'll have feeds to pull from those sources as well.
>
> > My current application layout is using appengine-patch to provide
> > django. I've set up a Boss and Youtube "model" with get methods that
> > handle getting the data. So I can do something similar to:
>
> > web_results = models.Boss.get(request.GET['term'], start=start)
> > news_results = models.Boss.get(request.GET['term'], vertical="news",
> > start=start)
> > youtube = models.Youtube.get(request.GET['term'], start=start)
>
> > Ideally, I'd like some of those models to be able to do asynchronous
> > tasks within their get function, and then also, I'd like to run the
> > above requests at the same, which should really speed the request up.
>
> > On Mar 17, 9:20 am, Joe Bowman  wrote:
> >> Thanks,
>
> >> I'm going to give it a go for urlfetch calls for one project I'm
> >> working on this week.
>
> >> Not sure when I'd be able to include it in gaeutiltiies for cron and
> >> such, that project is currently lower on my priority list at the
> >> moment, but can't wait until I get a chance to play with it. Another
> >> idea I had for it is the ROTmodel (retry on timeout model) in the
> >> project, which could speed that process up.
>
> >> On Mar 17, 9:11 am, David Wilson  wrote:
>
> >> > 2009/3/16 Joe Bowman :
>
> >> > > Wow that's great. The SDK might be problematic for you, as it appears
> >> > > to be very single threaded, I know for a fact it can't reply to
> >> > > requests to itself.
>
> >> > > Out of curiosity, are you still using base urlfetch, or is it your own
> >> > > creation? While when Google releases their scheduled tasks
> >> > > functionality it will be less of an issue, if your solution had the
> >> > > ability to fire off urlfetch calls and not wait for a response, it
> >> > > could be a perfect fit for the gaeutilities cron utility.
>
> >> > > Currently it grabs a list of tasks it's supposed to run on request,
> >> > > sets a timestamp, runs one, the compares now() to the timestamp and if
> >> > > the timedelta is more than 1 second, stops running tasks and finishes
> >> > > the request. It already appears your project would be perfect for
> >> > > running all necessary tasks at once, and the MIT License I believe is
> >> > > compatible with the BSD license I've released gaeutilities, so would
> >> > > you have any personal objection to me including it in gaeutilities at
> >> > > some point, with proper attribution of course?
>
> >> > Sorry I missed this in the first reply - yeah work away! :)
>
> >> > David
>
> >> > > If you haven't see that project, it's url 
> >> > > ishttp://gaeutilities.appspot.com/
>
> >> > > On Mar 16, 11:03 am, David Wilson  wrote:
> >> > >> Joe,
>
> >> > >> I've only tested it in production. ;)
>
> >> > >> The code should work serial

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-18 Thread bFlood

hey david,joe

I've got the async datastore Get working but I'm not sure the
callbacks are being run on a background thread. they appear to be when
you examine something like the thread local storage (hashes are all
unique) but then if you insert just a simple time.sleep they appear to
run serially. (note - while not completely new to async code, this is
my first run with python so I'm not sure of the threading contentions
of something like sleep or logging.debug)

I would like to be able to run some code just after the fetch for each
entity, the hope is that this would be run in parallel

any thoughts?

cheers
brian

On Mar 18, 6:14 am, Joe Bowman  wrote:
> Ah ha.. thanks David.
>
> And for the views, if I really wanted to launch everything at once, I
> could map my boss, youtube, twitter, etc etc pulls to their own urls,
> and use megafetch in my master view to pull those urls all at once
> too.
>
> On Mar 18, 5:14 am, David Wilson  wrote:
>
> > Hey Joe,
>
> > With the gdata package you can do something like this instead:
>
> > As usual, completely untested code, but looks about right..
>
> > from youtube import YouTubeVideoFeedFromString
>
> > def get_feeds_async(usernames):
> >     fetcher = megafetch.Fetcher()
> >     output = {}
>
> >     def cb(username, result):
> >         if isinstance(output, Exception):
> >             logging.error('could not fetch: %s', output)
> >             content = None
> >         else:
> >             content = YouTubeVideoFeedFromString(result.content)
> >         output[username] = content
>
> >     for username in usernames:
> >         url = 'http://gdata.youtube.com/feeds/api/users/%s/uploads'%\
> >             (username,)
> >         fetcher.start(url, lambda result: cb(username, result))
>
> >     fetcher.wait()
> >     return output
>
> > feeds = get_feeds_async([ 'davemw', 'waverlyflams', 'googletechtalks',
> >                           'TheOnion', 'winterelaxation' ])
>
> > # feeds is now a mapping of usernames to YouTubeVideoFeed instances,
> > or None if could not be fetched.
>
> > 2009/3/18 Joe Bowman :
>
> > > This may be a really dumb question, but.. I'm still learning so...
>
> > > Is there a way to do something other than a direct api call
> > > asynchronously? I'm writing a script that pulls from multiple sources,
> > > sometimes with higher level calls that use urlfetch, such as gdata.
> > > Since I'm attempting to pull from multiple sources, and sometimes
> > > multiple urls from each source, I'm trying to figure out if it's
> > > possible to run other methods at the same time.
>
> > > For example, I want to pull a youtube entry for several different
> > > authors. The youtube api doesn't allow multiple authors in a request
> > > (I have a enhancement request in for that though), so I need to do a
> > > yt_service.GetYouTubeVideoFeed() for each author, then splice them
> > > together into one feed. As I'm also working with Boss, and eventually
> > > Twitter, I'll have feeds to pull from those sources as well.
>
> > > My current application layout is using appengine-patch to provide
> > > django. I've set up a Boss and Youtube "model" with get methods that
> > > handle getting the data. So I can do something similar to:
>
> > > web_results = models.Boss.get(request.GET['term'], start=start)
> > > news_results = models.Boss.get(request.GET['term'], vertical="news",
> > > start=start)
> > > youtube = models.Youtube.get(request.GET['term'], start=start)
>
> > > Ideally, I'd like some of those models to be able to do asynchronous
> > > tasks within their get function, and then also, I'd like to run the
> > > above requests at the same, which should really speed the request up.
>
> > > On Mar 17, 9:20 am, Joe Bowman  wrote:
> > >> Thanks,
>
> > >> I'm going to give it a go for urlfetch calls for one project I'm
> > >> working on this week.
>
> > >> Not sure when I'd be able to include it in gaeutiltiies for cron and
> > >> such, that project is currently lower on my priority list at the
> > >> moment, but can't wait until I get a chance to play with it. Another
> > >> idea I had for it is the ROTmodel (retry on timeout model) in the
> > >> project, which could speed that process up.
>
> > >> On Mar 17, 9:11 am, David Wilson  wrote:
>
> > >> > 2009/3/16 Joe Bowman :
>
> > >> > > Wow that's great. The SDK might be problematic for you, as it appears
> > >> > > to be very single threaded, I know for a fact it can't reply to
> > >> > > requests to itself.
>
> > >> > > Out of curiosity, are you still using base urlfetch, or is it your 
> > >> > > own
> > >> > > creation? While when Google releases their scheduled tasks
> > >> > > functionality it will be less of an issue, if your solution had the
> > >> > > ability to fire off urlfetch calls and not wait for a response, it
> > >> > > could be a perfect fit for the gaeutilities cron utility.
>
> > >> > > Currently it grabs a list of tasks it's supposed to run on request,
> > >> > > sets a timestamp, runs one, t

[google-appengine] Re: Parallel urlfetch utility class / function.

2009-03-18 Thread Joe Bowman

Well, you'll never get a true parallel running of the callbacks, based
on the fact even if they're running in the same thread as the
urlfetch, each fetch will take a different amount of time. Though, I'm
not sure if the callbacks would run in the core thread or not. That's
where they'd be run if you see them running sequentially. I don't have
access to a machine to take the time to look at this until tonight,
and not sure even then I'd have the time. However, if I was to look at
it, I'd probably try these two approaches...

Since you're checking thread hashes, you could check the hash of the
thread the urlfetch uses, and see if the callback thread hash matches.

You could also do something like

Get a urlfetch-start-timestamp
urlfetch
Get a urlfetch-complete-timestamp
Get a callback-start-timestamp

Compare the urlfetch-start-timestamps to confirm they're all starting
at the same time.
Compare the urlfetch-complete-timestamps to the callback-start-
timestamps to see if the callback is indeed starting at the end of the
fetch.

On Mar 18, 8:11 am, bFlood  wrote:
> hey david,joe
>
> I've got the async datastore Get working but I'm not sure the
> callbacks are being run on a background thread. they appear to be when
> you examine something like the thread local storage (hashes are all
> unique) but then if you insert just a simple time.sleep they appear to
> run serially. (note - while not completely new to async code, this is
> my first run with python so I'm not sure of the threading contentions
> of something like sleep or logging.debug)
>
> I would like to be able to run some code just after the fetch for each
> entity, the hope is that this would be run in parallel
>
> any thoughts?
>
> cheers
> brian
>
> On Mar 18, 6:14 am, Joe Bowman  wrote:
>
> > Ah ha.. thanks David.
>
> > And for the views, if I really wanted to launch everything at once, I
> > could map my boss, youtube, twitter, etc etc pulls to their own urls,
> > and use megafetch in my master view to pull those urls all at once
> > too.
>
> > On Mar 18, 5:14 am, David Wilson  wrote:
>
> > > Hey Joe,
>
> > > With the gdata package you can do something like this instead:
>
> > > As usual, completely untested code, but looks about right..
>
> > > from youtube import YouTubeVideoFeedFromString
>
> > > def get_feeds_async(usernames):
> > >     fetcher = megafetch.Fetcher()
> > >     output = {}
>
> > >     def cb(username, result):
> > >         if isinstance(output, Exception):
> > >             logging.error('could not fetch: %s', output)
> > >             content = None
> > >         else:
> > >             content = YouTubeVideoFeedFromString(result.content)
> > >         output[username] = content
>
> > >     for username in usernames:
> > >         url = 'http://gdata.youtube.com/feeds/api/users/%s/uploads'%\
> > >             (username,)
> > >         fetcher.start(url, lambda result: cb(username, result))
>
> > >     fetcher.wait()
> > >     return output
>
> > > feeds = get_feeds_async([ 'davemw', 'waverlyflams', 'googletechtalks',
> > >                           'TheOnion', 'winterelaxation' ])
>
> > > # feeds is now a mapping of usernames to YouTubeVideoFeed instances,
> > > or None if could not be fetched.
>
> > > 2009/3/18 Joe Bowman :
>
> > > > This may be a really dumb question, but.. I'm still learning so...
>
> > > > Is there a way to do something other than a direct api call
> > > > asynchronously? I'm writing a script that pulls from multiple sources,
> > > > sometimes with higher level calls that use urlfetch, such as gdata.
> > > > Since I'm attempting to pull from multiple sources, and sometimes
> > > > multiple urls from each source, I'm trying to figure out if it's
> > > > possible to run other methods at the same time.
>
> > > > For example, I want to pull a youtube entry for several different
> > > > authors. The youtube api doesn't allow multiple authors in a request
> > > > (I have a enhancement request in for that though), so I need to do a
> > > > yt_service.GetYouTubeVideoFeed() for each author, then splice them
> > > > together into one feed. As I'm also working with Boss, and eventually
> > > > Twitter, I'll have feeds to pull from those sources as well.
>
> > > > My current application layout is using appengine-patch to provide
> > > > django. I've set up a Boss and Youtube "model" with get methods that
> > > > handle getting the data. So I can do something similar to:
>
> > > > web_results = models.Boss.get(request.GET['term'], start=start)
> > > > news_results = models.Boss.get(request.GET['term'], vertical="news",
> > > > start=start)
> > > > youtube = models.Youtube.get(request.GET['term'], start=start)
>
> > > > Ideally, I'd like some of those models to be able to do asynchronous
> > > > tasks within their get function, and then also, I'd like to run the
> > > > above requests at the same, which should really speed the request up.
>
> > > > On Mar 17, 9:20 am, Joe Bowman  wrote:
> > > >> Thanks,
>
> > > >> I'm