On Thu, Jun 18, 2009 at 7:14 PM, Mingfai <[email protected]> wrote:

> hi,
>
> If in case anyone have interest to run a crawler in Google App Engine, note
> the following:
>  - GAE doesn't support thread. So the task master of a GAE Droids will work
> in a significantly different way. There won't be any single thread or
> multiple thread task master as GAE doesn't support any long running process.
>
>  - so any unit of works, such as poll a link from queue, fetching, parsing
> etc has to be triggered by HTTP request. the free version has may have a 20
> cron jobs a day. without an external trigger, one probably want a "chained
> http request" that let a cron job to call an URLs that call multiple URLs in
> async manner to trigger multiple jobs. The whole design has to consider the
> general quota limitation for optimization. e.g. max http fetch per mins,
> bandwidth etc.
>
>  - Every HTTP request may last for as long as 30s, and subject to other
> limitation such as the stack in heap can't be larger than 1M, support only
> HTTP and HTTPS etc. So, if a page takes longer than 30s to download, there
> is no chance to process that page. for parsing and extraction, i suppose it
> could be done within 30s.
>
>  - GAE doesn't support socket. So Apache HttpClient cannot be used. (unless
> anyone know how to override just the connection code) but it's ok to depend
> on HttpEntity as we could construct a HttpEntity with inputstream or byte[]
>
> I think it would be good for us to separate any thread and
> socket/httpclient related code and allow people to extend our class to run a
> Droid in GAE.
>
> regards,
> mingfai
>



My GAE experiments results as a meaningful use case (and implementation).
Especially for "processing" crawler that i presume the bottleneck is at the
CPU and RAM of your server farm, we could utilize Google's processing power
and connectivity to do content extraction from a page, or image processing.
It works like this:

   - we deploy an app to GAE that
   - expose a http remoting interface (sth like RMI over HTTP), e.g.
        public Serializable extract( String url );
        public Set<Link> extract( LinkTask link, Parser parser ); // any arg
      and return value has to be serializable for sure

      - the interface does every processing intersive works including
      fetching the page, parse it (it is usually the most CPU
consuming process),
      and extract elements from a page. And return the extracted elements.

      - For me, I use Spring's Http Invoker. It takes just a few line of
      configuration to enable a normal Java service to be remotely accessible

      - Take an example, it takes 400ms to fetch the apache.org homepage,
      parse with NekoHtmlParser and extract all out links in GAE.

      - in our main crawler/droids
   - the worker instead of sending http request to the actual destination,
      it calls the remoting api in our GAE app, pass in the URL and get the
      extracted data.

      - so main task/link queue is still maintained in our local app. this
      is sth cannot be done in GAE until it supports some kind of Task Queue
      (which the beta is available for its Python environment already)

      - With Spring, we could switch from a local invocation to a remote
      http inocation just by configuration.


It's quite interesting and useful to me. Every Google Account (with unique
mobile phone number) can have 10 apps, each app may handle 3k call per min
(continously for around 3.5hrs to use up the daily quota). besides, there
are other advantages:

   - The requests come from Google's IP, so websites are less likely to
   broken your crawler.
   - faster connectivity means better throughput. the crawler won't need to
   spawn a lot of waiting threads to utilize bandwidth
   - unlimited scalability

to utilize GAE, Droids has to be refactored in a way that the
fetch-parse-extraction code be refactored to an isolated interface, without
using HttpClient. It's much easier to use Spring. If anyone want to try
that, I could create a DroidsWebService(s) module.

Regards,
mingfai






[1] -
http://daniel.gredler.net/2008/01/07/java-remoting-protocol-benchmarks/

[2] -
http://static.springframework.org/spring/docs/3.0.x/spring-framework-reference/html/ch21s04.html

Reply via email to