larm overview.xml

otis Tue, 29 Oct 2002 20:11:39 -0800

otis        2002/10/29 20:09:04

  Added:       xdocs/lucene-sandbox/larm overview.xml
  Log:
  - LARM Technical Overview.
  
  Revision  Changes    Path
  1.1                  jakarta-lucene/xdocs/lucene-sandbox/larm/overview.xml
  
  Index: overview.xml
  ===================================================================
  <?xml version="1.0" encoding="ISO-8859-1"?>
  <document>
  
      <properties>
          <author email="cmarschner at apache dot org">Clemens Marschner</author>
          <title>LARM Webcrawler - Technical Overview</title>
      </properties>
  
      <meta name="keyword" content="jakarta, java, LARM, WebCrawler,
            Crawler, Fetcher"/>
  
      <body>
  
          <section name="The LARM Web Crawler - Technical Overview">
  
              <p align="center">Author: Clemens Marschner</p>
  
              <p align="center">Revised: Oct. 28, 2002</p>
  
              <p>
                  This document describes the configuration parameters and the inner
                  workings of the LARM web crawler.
              </p>
  
  
              <subsection name="Purpose and Intended Audience">
  
                  <p>
                      This document was made for Lucene developers, not necessarily 
with any
                      background knowledge on crawlers, to understand the inner 
workings of
                      the LARM crawler, the current problems and some directions for 
future
                      development. The aim is to keep the entry costs low for people 
who have
                      an interest in developing this piece of software further.
                  </p>
  
                  <p>
                      It may also serve for actual users of the Lucene engine, but 
beware,
                      since there is a lot that will change in the near future, 
especially the
                      configuration.
                  </p>
  
              </subsection>
  
              <subsection name="Quick Start">
  
                  <p>The crawler is only accessible via anonymous CVS at this time. See
                      the <a 
href="http://jakarta.apache.org/site/cvsindex.html";>Jakarta CVS
                          instructions</a> for details if you're not familiar with 
CVS.</p>
  
                  <p>Too long? The following will give you a quick start: create a new
                      directory, say, "jakarta", make it your current directory, and 
type</p>
  
                  <source><![CDATA[cvs -d 
:pserver:[EMAIL PROTECTED]:/home/cvspublic
                      login
                      password: anoncvs
                      cvs -d :pserver:[EMAIL PROTECTED]:/home/cvspublic checkout
                      jakarta-lucene-sandbox]]></source>
  
                  <p>
                      The crawler will then be in
                      jakarta-lucene-sandbox/contributions/webcrawler-LARM. To compile 
it you will also need
                  </p>
  
                  <ul>
  
                      <li>this <a
                                  
href="http://www.innovation.ch/java/HTTPClient/";>HTTPClient</a>. Put the .zip file 
into the libs/ directory</li>
  
                      <li>a working installation of <a
                                                       
href="http://jakarta.apache.org/ant";>ANT</a> (1.5 or above recommended). ant.sh/.bat 
should be in your
                          PATH</li>
  
                  </ul>
  
                  <p>
                      Change to the webcrawler-LARM directory and type
                  </p>
  
                  <source>ant</source>
  
                  <p>
                      You should then have a working copy of LARM in
                      build/webcrawler-LARM-0.5.jar. See the section <a 
href="#syntax">Syntax</a> below on how to
                      start the crawler.<br/>
                  </p>
  
              </subsection>
  
  
          </section>
  
          <section name="Introduction">
  
              <subsection name="Why web crawlers at all?">
  
                  <p>
                      Web crawlers became necessary because the web standard protocols 
didn't
                      contain any mechanisms to inform search engines that the data on 
a web
                      server had been changed. If this were possible, a search engine 
could
                      be notified in a "push" fashion, which would simplify the total 
process
                      and would make indexes as current as possible.
                  </p>
  
                  <p>
                      Imagine a web server that notifies another web server that a 
link was
                      created from one of its pages to the other server. That other 
server
                      could then send a message back if the page was removed.
                  </p>
  
                  <p>
                      On the other hand, handling this system would be a lot more
                      complicated. Keeping distributed information up to date is an 
erroneous task. Even
                      in a single relational database it is often complicated to 
define and
                      handle dependencies between relations. Should it be possible to 
allow
                      inconsistencies for a short period of time? Should dependent 
data be
                      deleted if a record is removed? Handling relationships between 
clusters of
                      information well incorporates a new level of complexity.
                  </p>
  
                  <p>
                      In order to keep the software (web servers and browsers) simple, 
the
                      inventors of the web concentrated on just a few core elements - 
URLs for
                      (more or less) uniquely identifying distributed information, 
HTTP for
                      handling the information, and HTML for structuring it. That 
system was
                      so simple that one could understand it in a very short time. 
This is
                      probably one of the main reasons why the WWW became so popular. 
Well,
                      another one would probably be coloured, moving graphics of naked 
people.
                  </p>
  
                  <p>
                      But the WWW has some major disadvantages: There is no single 
index of
                      all available pages. Information can change without notice. URLs 
can
                      point to pages that no longer exist. There is no mechanism to 
get "all"
                      pages from a web server. The whole system is in a constant 
process of
                      change. And after all, the whole thing is growing at phenomenal 
rates.
                      Building a search engine on top of that is not something you can 
do on a
                      Saturday afternoon. Given the sheer size, it would take months 
to search
                      through all the pages in order to answer a single query, even if 
we had
                      a means to get from server to server, get the pages from there, 
and
                      search them. But we don't even know how to do that, since we 
don't know
                      all the web servers.
                  </p>
  
                  <p>
                      That first problem was addressed by bookmark collections, which 
soon
                      became very popular. The most popular probably was Yahoo, which 
evolved
                      to one of the most popular pages in the web just a year after it 
emerged
                      from a college dorm room.
                      The second problem was how to get the information from all those 
pages
                      laying around. This is where a web crawler comes in.
                  </p>
  
                  <p>
                      Ok, those engineers said, we are not able to get a list of all 
the
                      pages. But almost every page contains links to other pages. We 
can save a
                      page, extract all the links, and load all of these pages these 
links
                      point to. If we start at a popular location which contains a lot 
of links,
                      like Yahoo for example, chances should be that we can get "all" 
pages
                      on the web.
                  </p>
  
                  <p>
                      A little more formal, the web can be seen as a directional 
graph, with
                      pages as nodes and links as edges between them. A web crawler, 
also
                      called "spider" or "fetcher", uses the graph structure of the 
web to get
                      documents in order to be able to index them. Since there is no 
"push"
                      mechanism for updating our index, we need to "pull" the 
information on
                      our own, by repeatedly crawling the web.
                  </p>
  
              </subsection>
  
              <subsection name="Implementation - the first attempt">
  
                  <p>
                      "Easy", you may think now, "just implement what he said in the
                      paragraph before." So you start getting a page, extracting the 
links, following
                      all the pages you have not already visited. In Perl that can be 
done in
                      a few lines of code.
                  </p>
  
                  <p>
                      But then, very soon (I can tell you), you end up in a lot of 
problems:
                  </p>
  
                  <ul>
                      <li>a server doesn't respond. Your program always wait for it to 
time
                          out
                      </li><li>you get OutOfMemory errors soon after the beginning
                      </li><li>your hard drive fills up
                      </li><li>You notice that one page is loaded again time after 
time,
                          because the URL changed a little
                      </li><li>Some servers will behave very strange. They will 
respond after
                          30 seconds, sometimes they time out, sometimes they are not 
accessible
                          at all
                      </li><li>some URLs will get longer and longer. Suddenly you will 
get
                          URLs with a length of thousands of characters.
                      </li><li>But the main problem will be: you notice that your 
network
                          interface card (NIC) is waiting, and your CPU is waiting. 
What's going on?
                          The overall process will take days
                      </li>
                  </ul>
  
              </subsection>
  
  
              <subsection name="Features of the LARM crawler">
  
                  <p>
                      The LARM web crawler is a result of experiences with the errors 
as
                      mentioned above, connected with a lot of monitoring to get the 
maximum out
                      of the given system ressources. It  was designed with several 
different
                      aspects in mind:
                  </p>
  
                  <ul>
  
                      <li>Speed. This involves balancing the resources to prevent
                          bottlenecks. The crawler is multithreaded. A lot of work 
went in avoiding
                          synchronization between threads, i.e. by rewriting or 
replacing the standard
                          Java classes, which slows down multithreaded programs a lot
                      </li>
  
  
                      <li>Simplicity. The underlying scheme is quite modular and
                          comprehensible. See the description of the pipeline below
                      </li>
  
                      <li>Power. The modular design and the ease of the Java language 
makes
                          customisation simple
                      </li>
  
  
                      <li>Scalability. The crawler was supposed to be able to crawl 
<i>large
                              intranets</i> with hundreds of servers and hundreds of 
thousands of
                          documents within a reasonable amount of time. It was not 
ment to be
                          scalable to the whole Internet.</li>
  
                      <li>Java. Although there are many crawlers around at the time 
when I
                          started to think about it (in Summer 2000), I couldn't find 
a good
                          available implementation in Java. If this crawler would have 
to be integrated
                          in a Java search engine, a homogenous system would be an 
advantage. And
                          after all, I wanted to see if a fast implementation could be 
done in
                          this language.
                      </li>
  
                  </ul>
  
              </subsection>
  
              <subsection name="What the crawler can do for you, and what it cannot
                          (yet)">
  
                  <p>
                      What it can do for you:
                  </p>
  
                  <ul>
                      <li>Crawl a distinct set of the web, only restricted by a given 
regular
                          expression all pages have to match. The pages are saved into 
page files
                          of max. 50 MB (look at FetcherMain.java for details) and an 
index file
                          that contains the links between the URL and the position in 
the page
                          file. Links are logged as well. This is part of the standard 
LogStorage.
                          Other storages exist as well (see below)
                      </li>
  
                      <li>Crawling is done breadth first. Hosts are accessed in a 
round-robin
                          manner, to prevent the situation that all threads access one 
host at
                          once. However, at the moment there is no means to throttle 
access to a
                          server - the crawler works as fast as it can. There are also 
some
                          problems with this technique, as will be described below.
                      </li>
  
                      <li>The main part of the crawler is implemented as a pool of 
concurrent
                          threads, which speeds up I/O access
                      </li>
  
                      <li>The HTML link extractor has been optimised for speed. It was 
made
                          10 x faster than a generic SAX parser implementation
                      </li>
  
                      <li>A lot of logging and monitoring is done, to be able to track 
down
                          the going-ons in the inside
                      </li>
  
                      <li>A lot of parts of the crawler have already been optimised to
                          consume not more  memory then needed. A lot of the internal 
queues are cached
                          on hard drive, for example. Only the HashMap of already 
crawled pages
                          and the HostInfo structures still completely remain in 
memory, thus
                          limiting the number of crawled hosts and the number of 
crawled pages. At
                          the moment, OutOfMemory errors are not prevented, so beware.
  
                      </li><li>URLs are passed through a pipeline of filters that 
limit, for
                          example, the length of a URL, load robots.txt the first time 
a host is
                          accessed, etc. This pipeline can be extended easily by 
adding a Java
                          class to the pipeline.
                      </li><li>The storage mechanism is also pluggable. One of the next
                          issues would be to include this storage mechanism into the 
pipeline, to
                          allow a seperation of logging, processing, and storage
                      </li>
                  </ul>
  
                  <p>
                      On the other hand, at the time of this writing, the crawler has 
not yet
                      evolved into a production release. The reason is: until now, it 
just
                      served me alone.</p>
                  <p>
  
                      These issues remain:
                  </p>
  
                  <ul>
                      <li>
                          Still, there's a relatively high memory overhead <i>per 
server</i> and
                          also some overhead <i>per page</i>, especially since a map 
of already
                          crawled pages is held in memory at this time. Although some 
of the
                          in-memory structures are already put on disc, memory 
consumption is still
                          linear to the number of pages crawled. We have lots of ideas 
on how to
                          move that out, but since then, as an example with 500 MB 
RAM, the crawler
                          scales up to some 100.000 files on some 100s of hosts.
  
                      </li><li>It's not polite. It sucks out the servers, which can 
impose
                          DOS (Denial of Service) problems. We have plans to restrict 
concurrent
                          server accesses to only one, but that's not yet implemented.
  
                      </li><li>Only some of the configuration can be done with command 
line
                          parameters. The pipeline is put together in the startup 
procedure.
                          Configuration will be done from within an XML file in the 
future. We are
                          planning to use the Avalon framework for that.
  
                      </li><li>The ThreadMonitor is very experimental. It has evolved 
from a
                          pure monitoring mechanism to a central part of the whole 
crawler. It
                          should probably be refactored.
  
                      </li><li>Speed could still be optimized. Synchronization takes 
place
                          too often. At the moment URLs and documents are passed one 
after each
                          other. Should be changed such that groups of messages are 
passed in one
                          turn.
  
                      </li><li>The Lucene integration is pretty experimental, and also 
pretty
                          slow. It forms a major bottleneck if you use it.
  
                      </li><li>No processing whatsoever is done on the documents 
(except
                          extracting the links). It should be decided how much of this 
is supposed to
                          be done within the crawler, and what should be done in a post
                          processing step.
  
                      </li><li>Unix is the favoured operating system. I used a SUSE 
Linux
                          with 2.2 kernel. I remember that I ran into problems with 
the I/O routines
                          on Windows machines. I haven't tried it for a long time now, 
though.
  
                      </li><li>Only http is supported, no file server crawling with 
recurse
                          directory options, etc.
  
                      </li><li>
                          I noticed on high bandwidth (100 MBit/s) systems that very 
slowly
                          system sockets were eaten, although the Java code seemed to 
be ok.
  
                      </li>
                  </ul>
              </subsection>
  
              <subsection name="Syntax and runtime behaviour">
  
                  <p>
                      <a name="syntax"></a>
  
                      The command line options are very simple:
                  </p>
  
                  <source><![CDATA[
                      java [-server] [-Xmx[ZZ]mb] -classpath fetcher.jar
                      de.lanlab.larm.fetcher.FetcherMain
                      [-start STARTURL | @STARTURLFILE]+
                      -restrictto REGEX
                      [-threads[=10]]
                      -hostresolver HOSTRES.PROPERTIES
                      ]]></source>
  
                  <ul>
  
                      <li><b>-start</b> a start URL or, alternatively, a file with 
start URLs
                          (one in each line). In the latter case the file name must be 
preceded
                          by '@'. It must be a valid http-URL, including the http 
prefix </li>
                      <li><b>-restrictto</b>    a (Perl5) regular expression that all 
URLs have
                          to obey. These regexes are performed on the normalized 
version of a URL
                          (see below). If you are not familiar with regular 
expressions, see
                          <a 
href="http://www.perldoc.com/perl5.6.1/pod/perlre.html";>The Perl Man
                              Pages</a>.</li>
  
                      <li><b>-threads</b>       the number of concurrent threads that 
crawl the
                          pages. At this time, more than 25 threads don't provide any 
advantages
                          because synchronization effects and (probably) the overhead 
of the
                          scheduler slow the system down</li>
                  </ul>
  
                  <p>
                      <b>Java runtime options:</b>
                  </p>
  
                  <ul>
                      <li><b>-server</b>        starts the hot spot VM in server mode, 
which starts
                          up a little slower, but is faster during the run. 
Recommended</li>
                      <li><b>-Xmx&lt;ZZ&gt;mb</b>       sets the maximum size of the 
heap to
                          &lt;ZZ&gt; mb. Should be a lot. Set it to what you have.</li>
                  </ul>
  
                  <p>
                      You may also want to have a look at the source code, because some
                      options cannot be dealt with from the outside at this 
time.<br/><br/>
                  </p>
  
                  <p>
                      <b>Other options</b>
                  </p>
  
                  <p>
                      Unfortunately, a lot of the options are still not configurable 
from the
                      outside. Most of them are configured from within 
FetcherMain.java.
                      However, others are still spread over some of the other classes. 
At this
                      time, we tried to put a "FIXME" comment around all these 
options, so
                      check out the source code. </p>
  
                  <p>
                      <b>What happens now?</b>
                  </p>
  
                  <ol>
  
                      <li>The filter pipeline is built. The ScopeFilter is initialised 
with
                          the expression given by restrictto</li>
                      <!--zz-->
  
                      <li>The URL is put into the pipeline</li>
  
                      <li>The documents are fetched. If the mime type is text/html, 
links are
                          extracted and put back into the queue. The documents and 
URLs are
                          forwarded to the storage, which saves them</li>
                      <!--zz-->
  
                      <li>Meanwhile, every 5 seconds, the ThreadMonitor gathers 
statistics,
                          flushes log files, starts the garbage collection, and stops 
the fetcher
                          when everything seems to be done: all threads are idle, and 
nothing is
                          remaining in the queues</li>
  
                  </ol>
                  <!--zz-->
              </subsection>
              <!--zz !! -->
  
              <subsection name="Normalized URLs">
  
                  <p>
                      URLs are only supposed to make a web resource accessible. 
Unfortunately
                      there can be more than one representation of a URL, which can 
cause a
                      web crawler to save one file twice under different URLs. Two 
mechanisms
                      are used to reduce this error, while at the same time trying to 
keep
                      the second error (two URLs are regarded as pointing to the same 
file, but
                      are in fact two different ones):</p>
  
                  <ul>
                      <li>Common URL patterns are reduced to a known "canonical" form. 
I.
                          e. The URLs <i>http://host</i> and <i>http://host/</i> point 
to the same
                          file. The latter is defined as the canonical form of a root 
URL. For a
                          complete listing of these actions, see below</li>
                      <li>Host names can be edited through a set of rules manually 
passed
                          to the <i>HostResolver</i>, which can be configured in a 
configuration
                          file.</li>
                  </ul>
  
                  <p>
                      The result is used like a <i>stemming</i> function in IR 
systems: The
                      normalized form of a URL is used internally for comparisons, but 
to the
                      outside (i.e. for accessing the file), the original form is 
applied.
                  </p>
  
              </subsection>
  
              <subsection name="URL patterns">
  
                  <p>
                      These are the patterns applied by the <i>URLNormalizer</i>:
                  </p>
                  <ul>
                      <li>Host and path names are changed to lowercase. We know that on
                          Unix it may be possible to have two files in one directory 
that only
                          differ in their cases, but we don't think this will be used 
very often. On
                          the other hand, on Windows systems, case doesn't matter, so 
a lot of
                          Links use URLs with mixed cases</li>
                      <li>All characters except the ones marked 'safe' in the URL 
standard
                          are escaped. Spaces are always escaped as "%20" and not 
"+"</li>
                      <li>If Unicode characters are encountered, they are written as
                          escaped UTF-8 characters.</li>
                      <li>subsequent slashes '//' are removed. '/./' is reduced to 
'/'</li>
                      <li>index.* and default.* file names are removed</li>
                  </ul>
                  <p>Todo: '/../' is not handled yet</p>
                  <p>
                      Todo: Examples</p>
  
              </subsection>
  
              <subsection name="The Host Resolver">
  
                  <p>
                      The host resolver is also applied when the URL normalization 
takes
                      place. It knows three different types of rules:
                  </p>
  
                  <ul>
                      <li>startsWith: replace the start of a URL with something else 
if a
                          certain pattern is found at the start</li>
                      <li>endsWith: same with the end</li>
                      <li>synonym: treat a host name as a synonym to another</li>
                  </ul>
  
                  <p>
                      These three rules are applied to each URL in that order. I. e. 
you
                      can tell the host resolver to always remove the "www." of each 
host,
                      therefore regarding "cnn.com" and "www.cnn.com" as equal, by 
defining the
                      rule <i>setStartsWith("www.","")</i>
                  </p>
  
                  <p>
                      <b>Configuring HostResolver in a config file</b>
                  </p>
  
                  <p>
                      The HostResolver was one test on how config files could look 
like if
                      they were implemented using standard Java procedures. We used the
                      Jakarta BeanUtils for these matters (see
                      <a href="http://jakarta.apache.org/commons/beanutils.html";>The 
BeanUtils
                          website</a> for details) and implemented the HostResolver as 
a JavaBean. The
                      rules can then be stated as "mapped properties" in a
                      hostResolver.properties file (see the start syntax above). The 
only difference between
                      normal properties and the mapped properties as supported by 
BeanUtils is
                      that a second argument can be passed to the latter.
                  </p>
  
                  <p>
                      An example of such a properties file would look like this:
                  </p>
  
                  <source><![CDATA[
                      #hostResolver.properties
                      #format:
                      # startsWith(part1) = part2
                      #  if host starts with part1, this part will be replaced by part2
                      # endsWith(part1) = part2
                      #  if host ends with part1, this part will be replaced by part2.
                      #  This is done after startsWith was processed
                      # synonym(host1) = host2
                      # the keywords startsWith, endsWith and synonym are case 
sensitive
                      #  host1 will be replaced with host2. this is done _after_ 
startsWith
                      #  and endsWith was processed
  
                      startsWith(www.)=
                      startsWith(www2.)=
                      startsWith(w3.)=
                      endsWith(.somehost.com)=.somhost2.com
                      synonym(daedalus.apache.org)=apache.org
  
                      ]]></source>
  
                  <p>
                      As you can see, the file format itself is like the standard Java
                      <a 
href="http://java.sun.com/j2se/1.3/docs/api/java/util/Properties.html";>Properties</a>
                      format (comments etc.). Keywords are case sensitive.
                  </p>
  
  
                  <p>
                      At the time when the class was written, BeanUtils still had a bug
                      that dots "." were not supported in mapped properties indexes. 
As with
                      the new version (1.5 at the time of this writing) this is 
supposed to be
                      removed, but I have not tried yet. Therefore, the comma "," was 
made a
                      synonymon for dots. Since "," is not allowed in domain names, 
you can
                      still use (and even mix) them if you want, or if you only have 
an older
                      BeanUtils version available.
                  </p>
  
              </subsection>
  
              <section name="Lucene Integration">
  
                  <p>
                      LARM currently provides a very simple LuceneStorage that allows 
for
                      integrating the crawler with Lucene. It's ment to be a working 
example on
                      how this can be accomplished, not a final implementation. If you 
like
                      to volunteer on that part, contributions are welcome.</p>
  
                  <p>
                      The current storage simply takes the input that comes from the 
crawler
                      (a <i>WebDocument</i> which mainly consists of name/value pairs 
with
                      the document's contents) and puts it into a Lucene index. Each 
name/value
                      pair is put into one field. There's currently no incremental or 
update
                      operation, or document caching via a RAMDirectory. Therefore the
                      LuceneStorage becomes a bottleneck even with slow network 
connections.</p>
  
                  <p>
                      see storage/LuceneStorage.java and fetcher/FetcherMain.java for
                      details</p>
  
              </section>
  
  
          </section>
  
          <section name="Architecture">
  
              <p>I studied the <a 
href="http://research.compaq.com/SRC/mercator/research.html";>Mercator</a> web
                  crawler but decided to implement a somewhat different architecture.
                  Here is a high level overview of the default configuration:
              </p>
  
              <img src="/images/larm_architecture.jpg" width="568" height="460"
                   border="1" alt="Architecture Overview"/>
  
              <p>
                  The message handler is an implementation of a simple <i>chain of
                      responsibility</i>. Implementations of <i>Message</i> are passed 
down a
                  filter chain. Each of the filters can decide whether to send the 
message
                  along, change it, or even delete it. In this case, Messages of type
                  URLMessage are used. The message handler runs in its own thread. 
Thus, a call
                  of <i>putMessage()</i> or <i>putMessages()</i> resp. involve a
                  <i>producer-consumer</i>-like message transfer. The filters 
themselves run
                  within the message handler thread.
                  At the end of the pipeline the Fetcher distributes the incoming
                  messages to its worker threads. They are implemented as a <i>thread 
pool</i>:
                  Several <i>ServerThreads</i> are running concurrently and wait for
                  <i>Tasks</i> which include the procedure to be executed. If more 
tasks are
                  to be done than threads are available, they are kept in a queue, 
which
                  will be read whenever a task is finished.</p>
  
              <p>
                  At this point the pipeline pattern is left . The <i>FetcherTask</i>
                  itself is still quite monolithic. It gets the document, parses it if
                  possible, and stores it into a storage. In the future one might 
think of
                  additional configurable processing steps within another processing
                  pipeline. I thought about incorporating it into the filter pipeline, 
but since
                  the filters are passive components and the <i>FetcherThreads</i> are
                  active, this didn't work.
              </p>
  
              <subsection name="Performance">
  
                  <p>
                      The performance was improved about 10-15 times compared to the 
first
                      naive attempts with a pre-built parser and Sun's network 
classes. And
                      there is still room left. On a network with about 150 web 
servers, which
                      the crawler server was connected to by a 100 MBit FDDS 
connection, I was
                      able to crawl an average of 60 documents per second, or 3,7 MB, 
after
                      10 minutes in the startup period. In this first period, crawling 
is
                      slower because the number of servers is small, so the server 
output limits
                      crawling. There may also be servers that don't respond. They are
                      excluded from the crawl after a few attempts.
                  </p>
  
                  <p>
                      Overall, performance is affected by a lot of factors: The 
operating
                      system, the native interface, the Java libraries, the web 
servers, the
                      number of threads, whether dynamic pages are included in the 
crawl, etc.
                  </p>
  
                  <p>
                      From a development side, the speed is affected by the balance 
between
                      I/O and CPU usage. Both has to be kept at 100%, otherwise one of 
them
                      becomes the bottleneck. Managing these resources is the central 
part of a
                      crawler.
                      Imagine that only one thread is crawling. This is the worst 
case, as
                      can be seen very fast:</p>
  
                  <br/>
                  <br/>
                  <img src="/images/larm_crawling-process.jpg" width="600" height="306"
                       border="1" alt="Crawling Process"/>
  
                  <p>
                      The diagram to the right resembles a UML sequence diagram, 
except that
                      it stresses the time that a message needs to traverse the 
network.</p>
  
                  <ol>
                      <li>The URL is processed somehow. That's the filter part as 
stated
                          above</li>
                      <li>The request is sent. It goes through the different network 
layers
                          of the crawler server. A TCP/IP connection is established. 
Several
                          packets are sent back and forth. Then the crawler waits 
until the web server
                          processes the request, looks up the file or renders the page 
(which can
                          take several seconds or even minutes), then sends the file 
to the
                          crawler.</li>
                      <li>The crawler receives packet after packet, combines them to a 
file.
                          Probably it is copied through several buffers until it is 
complete.
                          This will take some CPU time, but mostly it will wait for 
the next packet
                          to arrive. The network transfer by itself is also affected 
by a lot of
                          factors, i.e. the speed of the web server, acknowledgement 
messages,
                          resent packages etc. so 100% network utilization will almost 
never be
                          reached.</li>
                      <li>The document is processed, which will take up the whole CPU. 
The
                          network will be idle at that time.</li>
                  </ol>
  
                  <p>
                      The storage process, which by itself uses CPU and disk I/O 
resources,
                      was left out here. That process will be very similar, although 
the
                      traversal will be faster.
                  </p>
  
                  <p>
                      As you can see, both CPU and I/O are not used most of the time, 
and
                      wait for the other one (or the network) to complete. This is the 
reason
                      why single threaded web crawlers tend to be very slow (wget for 
example).
                      The slowest component always becomes the bottleneck.</p>
  
                  <p>
                      Two strategies can be followed to improve this situation:</p>
  
                  <ol>
                      <li>use asynchronous I/O</li>
                      <li>use several threads</li>
                  </ol>
  
                  <p>
                      Asynchronous I/O means, I/O requests are sent, but then the 
crawler
                      continues to process documents it has already crawled.</p>
  
                  <p>
                      Actually I haven't seen an implementation of this technique.
                      Asynchronous I/O was not available in Java until version 1.4. An 
advantage would
                      be that thread handling is also an expensive process in terms of 
CPU
                      and memory usage. Threads are resources and, thus, limited. I 
heard that
                      application server developers wanted asynchronous I/O, to be 
able to
                      cope with hundreds of simultaneous requests without spawning 
extra
                      threads for each of them. Probably this can be a solution in the 
future. But
                      from what I know about it today, it will not be necessary </p>
  
                  <p>
                      The way this problem is solved usually in Java is with the use of
                      several threads. If many threads are used, chances are good that 
at any
                      given moment, at least one thread is in one of the states above, 
which
                      means both CPU and I/O will be at a maximum.</p>
                  <p>
                      The problem with this is that multi threaded programming is 
considered
                      to be one of the most difficult areas in computer science. But 
given
                      the simple linear structure of web crawlers, it is not very hard 
to avoid
                      race conditions or dead lock problems. You always get into 
problems
                      when threads are supposed to access shared resources, though. 
Don't touch
                      this until you have read the standard literature and have made 
at least
                      10 mistakes (and solved them!).</p>
                  <p>
                      Multithreading doesn't come without a cost, however. First, 
there is
                      the cost of thread scheduling. I don't have numbers for that in 
Java, but
                      I suppose that this should not be very expensive. MutExes can 
affect
                      the whole program a lot . I noticed that they should be avoided 
like
                      hell. In a crawler, a MutEx is used, for example, when a new URL 
is passed
                      to the thread, or when the fetched documents are supposed to be 
stored
                      linearly, one after the other.</p>
                  <p>
                      For example, the tasks used to insert a new URL into the global 
message
                      handler each time when a new URL was found in the document. I 
was able
                      to speed it up considerably when I changed this so that the URLs 
are
                      collected locally and then inserted only once per document. 
Probably this
                      can be augmented even further if each task is comprised of 
several
                      documents which are fetched one after the other and then stored
                      together.</p>
                  <p>
                      Nonetheless, keeping the right balance between the two resources 
is a
                      big concern. At the moment, the number of threads and the number 
of
                      processing steps is static, and is only optimised by trial and 
error. Few
                      hosts, slow network -> few threads. slow CPU -> few processing 
steps.
                      many hosts, fast network -> many threads. Probably those 
heuristics will
                      do well, but I wonder if these figures could also be fine-tuned
                      dynamically during runtime.</p>
                  <p>
                      Another issue that was optimised were very fine-grained method 
calls.
                      For example, the original implementation of the HTML parser used 
to call
                      the read()-method for each character. This call had probably to
                      traverse several Decorators until it got to a - synchronized 
call. That's why
                      the CharArrayReader was replaced by a SimpleCharArrayReader, 
because
                      only one thread works on a document at a time.</p>
                  <p>
                      These issues can only be traced down with special tools, i.e.
                      profilers. The work is worth it, because it allows one to work 
on the 20% of the
                      code that costs 80% of the time.
                  </p>
  
              </subsection>
  
              <subsection name="Memory Usage">
  
                  <p>
                      One "web crawler law" could be defined as:</p>
  
                  <source>
                      What can get infinite, will get infinite.
                      What can go wrong, will go wrong.
                      Eventually.
                      Very soon.
                  </source>
  
                  <p>
                      A major task during the development was to get memory usage low. 
But a
                      lot of work still needs to be done here. Most of the 
optimizations
                      incorporated now move the problem from main memory to the hard 
disk, which
                      doesn't solve the problem.<br/>
                  </p>
  
                  <p>
                      Here are some means that were used:
                  </p>
  
                  <ul>
  
                      <li>CachingQueues: The message queue, the Fetcher queue, the 
robot
                          exclusion queue (see below) - a lot of queues can fill up 
the whole main
                          memory in a very short period of time. But since queues are 
only accessed
                          at their ends, a very simple mechanism was implemented to 
keep memory
                          usage low: The queue was divided into blocks of fixed size. 
Only the two
                          blocks at its end are kept in RAM. The rest is serialized on 
disk. In
                          the end, only a list of block references has to be managed
                      </li>
  
                      <li>Define a maximum value for everything, and keep an eye on it.
                          Downloaded files can get "infinitely" large. URLs can get 
infinitely long.
                          Servers may contain an infinite set of documents. A lot of 
these checks
                          had to be included even in the university network mentioned. 
A special
                          case were the URLs. Some .shtml pages on a web server 
pointed to a
                          subdirectory that didn't exist but revealed the same page. 
If these errors
                          are introduced at will, they are called crawler traps: An 
infinite URL
                          space. The only way of dealing with this is manually 
excluding the
                          hosts.
                      </li>
  
                      <li>Optimized HTML parser. Current parsers tend to create a huge 
amount
                          of very small objects. Most of that work is unnecessary for 
the task to
                          be done. This can only be optimised by stripping down the 
parser to do
                          only what it is supposed to do in that special task.
                          However, there still remains a problem: The HashMap of 
already visited
                          URLs needs to be accessed randomly while reading and 
writing. I can
                          imagine only two ways to overcome this:
                      </li>
  
                      <li>Limiting, in some way, the number of URLs in RAM. If the 
crawler
                          were distributed, this could be done by assigning only a 
certain number
                          of hosts to each crawler node, while at the same time 
limiting the
                          number of pages read from one host. In the end this will 
only limit the
                          number of hosts that can be crawled by the number of crawler 
nodes
                          available. Another solution would be to store complete hosts 
on drive, together
                          with the list of unresolved URLs. Again, this shifts the 
problem only
                          from RAM to the hard drive
                      </li>
  
                      <li>Something worth while would be to compress the URLs. A lot 
of parts
                          of URLs are the same between hundreds of URLs (i.e. the host 
name). And
                          since only a limited number of characters are allowed in 
URLs, Huffman
                          compression will lead to a good compression rate . A first 
attempt
                          would be to incorporate the visited URLs hash into the 
HostInfo structure.
                          After all, the VisitedFilter hash map turned out to be the 
data
                          structure that will take up most of the RAM after some time.
                      </li>
  
                  </ul>
  
              </subsection>
  
              <subsection name="The Filters">
  
                  <p>
                      Most of the functionality of the different filters has already 
been
                      described. Here's another, more detailed view :
                  </p>
  
                  <p>
                      <b>RobotExclusionFilter</b>
                  </p>
  
                  <p>
                      The first implementation of this filter just kept a list of 
hosts, and
                      every time a new URLMessage with an unknown host came by, it 
attempted
                      to read the robots.txt file first to determine whether the URL 
should
                      be filtered.
                      A major drawback of that was that when the server was not 
accessible
                      somehow, the whole crawler was held until the connection timed 
out (well
                      with Sun's classes that even didn't happen, causing the whole 
program
                      to die).
                      The second implementation has its own little ThreadPool, and 
keeps a
                      state machine of each host in the HostInfo structure.
                  </p>
  
                  <p>
                      If the host manager doesn't contain a HostInfo structure at all, 
the
                      filter creates it and creates a task to get the robots.txt file. 
During
                      this time, the host state is set to "isLoadingRobotsTxt", which 
means
                      further requests to that host are put into a queue. When loading 
is
                      finished, these URLs (and all subsequent ones) are put back to 
the beginning
                      of the queue.
                  </p>
  
                  <p>
                      After this initial step, every URL that enters the filter is 
compared
                      to the disallow rules set (if present), and is filtered if 
necessary.
                  </p>
  
                  <p>
                      Since the URLs are put back to the beginning of the queue, the 
filter
                      has to be put in front of the VisitedFilter.
                  </p>
  
                  <p>
                      In the host info structure, which is also used by the 
FetcherTasks,
                      some information about the health of the hosts is stored as 
well. If the
                      server is in a bad state several times, it is excluded from the 
crawl.
                      Note that it is possible that a server will be accessed more 
than the
                      (predefined) 5 times that it can time out, since a FetcherThread 
may
                      already have started to get a document when another one marks it 
as bad.
                  </p>
  
                  <p>
                      <b>URLLengthFilter</b>
                  </p>
  
                  <p>
                      This very simple filter just filters a URL if a certain (total) 
length
                      is exceeded
                  </p>
  
                  <p>
                      <b>KnownPathsFilter</b>
                  </p>
  
                  <p>
                      This one filters some very common URLs (i.e. different views of 
an
                      Apache directory index), or hosts known to make problems. Should 
be more
                      configurable from outside in the future
                  </p>
  
                  <p>
                      <b>URLScopeFilter</b>
                  </p>
  
                  <p>
                      The scope filter filters a URL if it doesn't match a given 
regular
                      expression.
                  </p>
  
                  <p>
                      <b>URLVisitedFilter</b>
                  </p>
  
                  <p>
                      This filter keeps a HashMap of already visited URLs, and filters 
out
                      what it already knows
                  </p>
  
                  <p>
                      <b>Fetcher</b>
                  </p>
  
                  <p>
                      The fetcher itself is also a filter that filters all URLs - they 
are
                      passed along to the storage as WebDocuments, in a different 
manner. It
                      contains a ThreadPool that runs in its own thread of control, 
which takes
                      tasks from the queue an distributes them to the different
                      FetcherThreads.
                  </p>
  
                  <p>
                      In the first implementation the fetcher would simply distribute 
the
                      incoming URLs to the threads. The thread pool would use a simple 
queue to
                      store the remaining tasks. But this can lead to a very "unpolite"
                      distribution of the tasks: Since � of the links in a page point 
to the same
                      server, and all links of a page are added to the message handler 
at
                      once, groups of successive tasks would all try to access the 
same server,
                      probably causing denial of service, while other hosts present in 
the
                      queue are not accessed.
                  </p>
  
                  <p>
                      To overcome this, the queue is divided into different  parts, 
one for
                      each host. Each host contains its own (caching) queue. But the 
methods
                      used to pull tasks from the "end" of this queue cycle through 
the hosts
                      and always get a URL from a different host.
                  </p>
  
                  <p>
                      One major problem still remains with this technique: If one host 
is
                      very slow, it can still slow down everything. Since with n host 
every nth
                      task will be accessed to this host, it can eat one thread after 
the
                      other if loading a document takes longer than loading it from 
the (n-1)
                      other servers. Then two concurrent requests will result on the 
same
                      server, which slows down the response times even more, and so 
on. In
                      reality, this will clog up the queue very fast. A little more 
work has to be
                      done to avoid these situations, i.e. by limiting the number of 
threads
                      that access one host at a time.
                  </p>
  
                  <p>
                      <b>A Note on DNS</b>
                  </p>
  
                  <p>
                      The Mercator crawler document stresses a lot on resolving host 
names.
                      Because of that, a DNSResolver filter was implemented in the 
very first
                      time. Two reasons prevented that it is used any more:
                  </p>
  
                  <ul>
  
                      <li>newer versions of the JDK than the one Mercator used resolve 
the IP
                          address of a host the first time it is accessed, and keep a 
cache of
                          already resolved host names.</li>
  
                      <li>the crawler itself was designed to crawl large local 
networks, and
                          not the internet. Thus, the number of hosts is very 
limited.</li>
  
                  </ul>
  
              </subsection>
          </section>
  
  
          <section name="Future Enhancements">
  
              <subsection name="Politeness">
  
                  <p>
                      A crawler should not cause a Denial of Service attack. So this 
has to
                      be addressed.
                  </p>
  
              </subsection>
  
              <subsection name="The processing pipeline">
  
                  <p>
                      The FetcherTask, as already stated, is very monolithic at this 
time.
                      Probably some more processing should be done at this step (the 
problem
                      with balanced CPU/IO usage taken into account). At least 
different
                      handlers for different mime types should be provided, i.e. to 
extract links
                      from PDF documents. The Storage should also be broken up. I only 
used
                      the LogStorage within the last months, which now doesn't only 
writes to
                      log files, but also stored the files on disk. This should 
probably be
                      replaced by a storage chain where different stores could be 
appended.
                  </p>
  
              </subsection>
  
  
  
              <subsection name="LARM as a real server">
  
                  <p>
                      The only way to start a crawl today is starting the crawler from 
the
                      shell. But it could also remain idle and wait for commands from 
an RMI
                      connection or expose a Web Service. Monitoring could be done by 
a simple
                      included web server that provides current statistics via HTML
                  </p>
  
                  <ul>
                      <li><b>Recommended Reading: </b> We plan on using the
                      <a href="http://jakarta.apache.org/avalon";>Avalon</a> 
Framework</li> for state
                      management and configuration. A very early proposal can
                      be found
                      <a 
href="http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@;jakarta.apache.org&msgId=399399">here</a>.
                      We found out that Avalon provides a lot of the functionality 
described
                      as necessary in that posting.
                  </ul>
  
  
              </subsection>
  
              <subsection name="Distribution">
  
                  <p>
                      Distribution is a big issue. Some people say "Distribute your 
program
                      late. And then later." But as others have implemented distributed
                      crawlers, this should not be very hard.<br/>
                      I see two possible architectures for that:
                  </p>
  
                  <ul>
                      <li>Write a single dispatcher (a star network) that contains the 
whole
                          MessageHandler except the Fetcher itself. The crawlers are 
run as
                          servers (see above), and are configured with a URL source 
that gets their
                          input from the dispatcher and a MessageHandler that stores 
URLs back to
                          the dispatcher. The main drawback being that this can become 
a
                          bottleneck.
                      </li>
                      <li>Partition the domain to be crawled into several parts. This 
could
                          be done for example by dividing up different intervals of 
the hash value
                          of the host names. Then plugging in another crawler could be 
done
                          dynamically, even within a peer to peer network. Each node 
knows which node
                          is responsible for which interval, and sends all URLs to the 
right
                          node. This could even be implemented as a filter.
                      </li>
                  </ul>
  
                  <p>
                      One thing to keep in mind is that the number of URLs transferred 
to
                      other nodes should be as large as possible. </p>
  
                  <p>
                      The next thing to be distributed is the storage mechanism. Here, 
the
                      number of pure crawling nodes and the number of storing (post 
processing)
                      nodes could possibly diverge. An issue here is that the whole 
documents
                      have to be transferred over the net. </p>
  
  
                  <ul>
                      <li><b>Recommended Reading: </b> <i>Junghoo Cho and Hector
                              Garcia-Molina (2002).
                              <a 
href="http://citeseer.nj.nec.com/cho02parallel.html";>Parallel crawlers</a>.
                              In Proc. of the 11th International World--Wide Web 
Conference, 2002</i></li>
                  </ul>
  
              </subsection>
  
              <subsection name="URL Reordering">
  
                  <p>
                      One paper discussed different types of reordering URLs while 
crawling .
                      One of the most promising attempts was to take the calculated 
PageRank
                      into account . Crawling pages with higher PageRanks first seemed 
to get
                      important pages earlier. This is not rocket science, the 
research was
                      already done years ago.
                  </p>
  
                  <ul>
                      <li><b>Recommended Reading: </b> <i>J. Cho, H. Garcia-Molina 
&amp; L.
                              Page (1998),
                              <a
                                 
href="http://citeseer.nj.nec.com/30730.html";>Efficient Crawling Through URL 
Ordering</a>,
                              Seventh International Web Conference, Brisbane, 
Australia, 14-18 April</i></li>
                  </ul>
  
              </subsection>
  
              <subsection name="Recovery">
  
                  <p>
                      At the moment there is no way of stopping and restarting a 
crawl. There
                      should be a mechanism to move the current state of the crawler 
to disk,
                      and, in case of a failure, to recover and continue from the last 
saved
                      state.
                  </p>
  
              </subsection>
  
          </section>
  
      </body>
  
  </document>


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@;jakarta.apache.org>

cvs commit: jakarta-lucene/xdocs/lucene-sandbox/larm overview.xml

Reply via email to