Notes for running Droids in Google App Engine

Mingfai Thu, 18 Jun 2009 04:14:35 -0700

hi,

If in case anyone have interest to run a crawler in Google App Engine, note
the following:
 - GAE doesn't support thread. So the task master of a GAE Droids will work
in a significantly different way. There won't be any single thread or
multiple thread task master as GAE doesn't support any long running process.


 - so any unit of works, such as poll a link from queue, fetching, parsing
etc has to be triggered by HTTP request. the free version has may have a 20
cron jobs a day. without an external trigger, one probably want a "chained
http request" that let a cron job to call an URLs that call multiple URLs in
async manner to trigger multiple jobs. The whole design has to consider the
general quota limitation for optimization. e.g. max http fetch per mins,
bandwidth etc.

 - Every HTTP request may last for as long as 30s, and subject to other
limitation such as the stack in heap can't be larger than 1M, support only
HTTP and HTTPS etc. So, if a page takes longer than 30s to download, there
is no chance to process that page. for parsing and extraction, i suppose it
could be done within 30s.

 - GAE doesn't support socket. So Apache HttpClient cannot be used. (unless
anyone know how to override just the connection code) but it's ok to depend
on HttpEntity as we could construct a HttpEntity with inputstream or byte[]

I think it would be good for us to separate any thread and socket/httpclient
related code and allow people to extend our class to run a Droid in GAE.

regards,
mingfai

Notes for running Droids in Google App Engine

Reply via email to