hi, If in case anyone have interest to run a crawler in Google App Engine, note the following: - GAE doesn't support thread. So the task master of a GAE Droids will work in a significantly different way. There won't be any single thread or multiple thread task master as GAE doesn't support any long running process.
- so any unit of works, such as poll a link from queue, fetching, parsing etc has to be triggered by HTTP request. the free version has may have a 20 cron jobs a day. without an external trigger, one probably want a "chained http request" that let a cron job to call an URLs that call multiple URLs in async manner to trigger multiple jobs. The whole design has to consider the general quota limitation for optimization. e.g. max http fetch per mins, bandwidth etc. - Every HTTP request may last for as long as 30s, and subject to other limitation such as the stack in heap can't be larger than 1M, support only HTTP and HTTPS etc. So, if a page takes longer than 30s to download, there is no chance to process that page. for parsing and extraction, i suppose it could be done within 30s. - GAE doesn't support socket. So Apache HttpClient cannot be used. (unless anyone know how to override just the connection code) but it's ok to depend on HttpEntity as we could construct a HttpEntity with inputstream or byte[] I think it would be good for us to separate any thread and socket/httpclient related code and allow people to extend our class to run a Droid in GAE. regards, mingfai
