Re: [Nutch-general] Nutch Step by Step Maybe someone will find this useful ?

qi wu Wed, 04 Apr 2007 08:19:57 -0700

Great!
Thank you Corrado!

----- Original Message ----- 
From: "zzcgiacomini" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Wednesday, April 04, 2007 10:53 PM
Subject: Nutch Step by Step Maybe someone will find this useful ?



>I have spent sometime playing with nutch-0 and collecting notes from the 
> mailing lists ...
> may be someone will find these notes useful end could point me out  
> mistakes
> I am not at all a nutch expert...
> -Corrado
> 
> 
> 
> 
>


--------------------------------------------------------------------------------


> 0) CREATE NUTCH USER AND GROUP
> 
>    Create a nutch user and group and perform all the following logged in as 
> nutch user.
>    put this line in your .bash_profile
> 
>    export JAVA_HOME=/opt/jdk
>    export PATH=$JAVA_HOME/bin:$PATH
> 
> 1) GET HADOOP and NUTCH
> 
>    downloaded the nutch and hadoop trunks as well explained on 
>    http://lucene.apache.org/hadoop/version_control.html
>    (svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk)
>    (svn checkout http://svn.apache.org/repos/asf/lucene/hadoop/trunk)
> 
> 2) BUILD HADOOP
> 
>    Ex: 
> 
>    Build and produce the tar file
>    cd hadoop/trunk
>    ant tar
> 
>    To build hadoop with native libraries 64bits proceed as follow :
> 
>    A ) dowonload and install latest lzo library 
> (http://www.oberhumer.com/opensource/lzo/download/)
>        Note: the current available pkgs for fc5 are too old 
> 
>        tar xvzf lzo-2.02.tar.gz
>        cd lzo-2.02
>        ./configure --prefix=/opt/lzo-2.02
>        make install
> 
>    B) compile native 64bit libs for hadoop  if needed
> 
>        cd hadoop/trunk/src/native
> 
>        export LDFLAGS=-L/opt/jdk/jre/lib/amd64/server
>        export JVM_DATA_MODEL=64
> 
>        CCFLAGS="-I/opt/lzo-2.02/include" CPPFLAGS="-I/opt/lzo-2.02/include" 
> ./configure
> 
>        cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/
>        cp src/org_apache_hadoop.h ./src/org/apache/hadoop/io/compress/lzo
>        cp 
> src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
> src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h
>        cp 
> src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
> src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h
> 
>        in config.h replace the line
> 
>        #define HADOOP_LZO_LIBRARY libnotfound.so 
> 
>        with this one
> 
>        #define HADOOP_LZO_LIBRARY "liblzo2.so"
>        make         
> 
> 3) BUILD NUTCH
> 
>    nutch-dev nigthly trunk now comes with hadoop-0.12.jar but may be you want 
> to put the last nightly build hadoop jar 
> 
>    mv nutch/trunk/lib/hadoop-0.12.jar nutch/trunk/lib/hadoop-0.12.jar.ori
>    cp hadoop/trunk/build/hadoop-0.12.3-dev.jar nutch/trunk/lib/hadoop-0.12.jar
>    cd nutch/trunk
>    ant tar
> 
> 4) INSTALL
> 
>    copy and untar the genearated .tar.gz file on the machines that will 
> participate to the engine activities
>    In my case I only have two identical machines available called myhost2 and 
> myhost1. 
> 
>    On each of them I have installed nutch binaries under /opt/nutch while I 
> have dicided to have the hadoop 
>    distributed filesystem in a directory called hadoopFs located under a 
> large disk munted on /disk10
> 
>    
>    on both machines create the directory:
>    mkdir /disk10/hadoopFs/ 
> 
>    copy hadoop 64bit native libraries  if needed
>    
>    mkdir /opt/nutch/lib/native/Linux-x86_64
>    cp -fl hadoop/trunk/src/native/lib/.libs/* 
> /opt/nutch/lib/native/Linux-x86_64
> 
> 5) CONFIG
> 
>    I will use the myhost1 as the master machine running the nodename and 
> jobtracker tasks; it will also run the datanode and tasktraker on it.
>    myhost2 will only run datanode and takstraker.
> 
>    A) on both the machines change the conf/hadoop-site.xml configuration 
> file. Here are values I have used 
> 
>       fs.default.name     : myhost1.mydomain.org:9010
>       mapred.job.tracker  : myhost1.mydomain.org:9011
>       mapred.map.tasks    : 40
>       mapred.reduce.tasks : 3
>       dfs.name.dir        : /opt/hadoopFs/name
>       dfs.data.dir        : /opt/hadoopFs/data
>       mapred.system.dir   : /opt/hadoopFs/mapreduce/system
>       mapred.local.dir    : /opt/hadoopFs/mapreduce/local
>       dfs.replication     : 2
> 
>       "The mapred.map.tasks property tell how many tasks you want to run in 
> parallel.
>        This should be a multiple of the number of computers that you have. 
>        In our case since we are starting out with 2 computer we will have 4 
> map and 4 reduce tasks.
> 
>       "The dfs.replication property states how many servers a single file 
> should be
>       replicated to before it becomes available.  Because we are using 2 
> servers I have set 
>       this at 2. 
> 
>       may be you want also change nutch-site by adding  with a different 
> value then the default of 3
> 
>       http.redirect.max   : 10
> 
> 
>    B) be sure that your  conf/slaves file contains the name of the slaves 
> machines. In my cases:
> 
>       myhost1.mydomain.org
>       myhost2.mydomain.org
> 
>    C) create directories for pids and log files on both machines
> 
>       mkdir /opt/nutch/pids
>       mkdir /opt/nutch/logs
> 
>    D) on both machines change conf/hadoop-env.sh file to point to the right 
> java and nutch installation.
> 
>       export HADOOP_HOME=/opt/nutch
>       export JAVA_HOME=/opt/jdk
>       export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
>       export HADOOP_PID_DIR=${HADOOP_HOME}/pids
> 
>    E) Because of a problem on the classloader in nutch the following lines 
> need to be set
>       in nutch/bin/hadoop script file before it star building the CLASSSPATH 
> variable
> 
>       for f in $HADOOP_HOME/nutch-*.jar; do
>         CLASSPATH=${CLASSPATH}:$f;
>       done
> 
>       This will put nutch-*.jar file into CLASSPATH
> 
> 6) SSH SETUP ( Important!! )
> 
>    Setup ssh as explained in http://wiki.apache.org/nutch/NutchHadoopTutorial
>    and test the ability to password-less login on itself and  from myhost1 to 
> bas24 and viceversa.
>    This is a very important step to avoid communication refused problems 
> between daemons.
> 
>    Here is a short example on how to proceed :
>    A) use ssh-keygen to create .ssh/id_dsa files :
> 
>        ssh-keygen -t dsa
>        Generating public/private dsa key pair.
>        Enter file in which to save the key (/home/nutch/.ssh/id_dsa):
>        Enter passphrase (empty for no passphrase):
>        Enter same passphrase again:
>        Your identification has been saved in /home/nutch/.ssh/id_dsa.
>        Your public key has been saved in /home/nutch/.ssh/id_dsa.pub.
>        The key fingerprint is:
>        01:36:6c:9d:27:09:54:e4:ff:fb:20:86:8c:e1:6c:82 [EMAIL PROTECTED]
> 
>    B)  copy .ssh/id_dsa.pub on all machines as .ssh/authorized_keys
>    C)  on each machine configure ssh-agent to start at login  adding a line 
> in .xsession
>        ex : ssh-agent startkde.
> 
>        or eval `ssh-agent` in .bashrc ( this will start an ssh-agent for 
> every new shell)
>    D)  Use ssh-ad to add the dsa key
> 
>        ssh-add
>        Enter passphrase for /home/nutch/.ssh/id_dsa:
>        Identity added: /home/nutch/.ssh/id_dsa (/home/nutch/.ssh/id_dsa)
> 
> 
> 7) FORMAT HADOOP FILESYSTEM 
> 
>    "Fix for HADOOP-19.  A namenode must now be formatted before it may be 
> used.  Attempts to
>    start a namenode in an unformatted directory will fail, rather than 
> automatically
>    creating a new, empty filesystem, causing existing datanodes to delete all 
> blocks.  
>    Thus a mis-configured dfs.data.dir should no longer cause data loss"
> 
>    on the master machine (myhost1)  run these command:
>    cd /opt/nutch/
>    bin/hadoop namenode -format
> 
>    This will create the /opt/hadoopFs/name/image directory
> 
> 8) START NODENAME
> 
>    start the namenode on the master machine (myhost1)
> 
>    bin/hadoop-daemon.sh start namenode
> 
>    starting namenode, logging to 
> /opt/nutch/logs/hadoop-nutch-namenode-myhost1.mydomain.org.out
>    060509 150431 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060509 150431 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060509 150431 directing logs to directory /opt/nutch/logs
> 
> 9) START DATANODES
> 
>    starting datanode on the master  and all slaves machines (myhost1 and 
> myhost2)
> 
>    on myhost1:
> 
>    bin/hadoop-daemon.sh start datanode
> 
>    tarting datanode, logging to 
> /opt/nutch/logs/hadoop-nutch-datanode-myhost1.mydomain.org.out
>    060509 150619 0x0000000a parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060509 150619 0x0000000a parsing 
> file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060509 150619 0x0000000a directing logs to directory /opt/nutch/logs
> 
>    on myhost2:
> 
>    bin/hadoop-daemon.sh start datanode
> 
>    starting datanode, logging to 
> /opt/nutch/logs/hadoop-nutch-datanode-myhost2.mydomain.org.out
>    060509 151517 0x0000000a parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060509 151517 0x0000000a parsing 
> file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060509 151517 0x0000000a directing logs to directory /opt/nutch/logs
> 
> 
> 10) START JOBTRAKER
> 
>    start jobtracker  on the master machine (myhost1)
> 
>    on myhost1
> 
>    bin/hadoop-daemon.sh start jobtracker
> 
>    starting jobtracker, logging to 
> /opt/nutch/logs/hadoop-nutch-jobtracker-myhost1.mydomain.org.out
>    060509 152020 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060509 152021 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060509 152021 directing logs to directory /opt/nutch/logs
> 
> 11)  START TASKTARKERS
> 
>    start tasktracker on the slaves machines (myhost2 and myhost1)
> 
>    on myhost1:
> 
>    bin/hadoop-daemon.sh start tasktracker
> 
>    starting tasktracker, logging to 
> /opt/nutch/logs/hadoop-nutch-tasktracker-myhost1.mydomain.org.out
>    060509 152236 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060509 152236 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060509 152236 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060509 152236 directing logs to directory /opt/nutch/logs
> 
>    on myhost2:
> 
>    bin/hadoop-daemon.sh start tasktracker
> 
>    starting tasktracker, logging to 
> /opt/nutch/logs/hadoop-nutch-tasktracker-myhost2.mydomain.org.out
>    060509 152333 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060509 152333 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060509 152333 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060509 152333 directing logs to directory /opt/nutch/logs
> 
>    NOTE: Now that we have verified that daemons start and connects properly 
> we can  star and
>          stop all of them  using the start-all.sh and stop-all. scripts from 
> the master machine
> 
> 12) TEST FUNCTIONALITY
> 
>    Test hadoop functionality ... just a simple ls
>    
>    bin/hadoop dfs -ls
> 
>    060509 152844 parsing 
> jar:file:/home/opt/nutch-0.8-dev/lib/hadoop-0.2-dev.jar!/hadoop-default.xml
>    060509 152845 parsing file:/home/opt/nutch-0.8-dev/conf/hadoop-site.xml
>    060509 152845 No FS indicated, using default:localhost:9010
>    060509 152845 Client connection to 127.0.0.1:9010: starting
>    Found 0 items
> 
>    The dfs filesystem is empty.. of course..
> 
> 13) CRATE FILE FOR URLs INJECT 
> 
>    Now we need to create a crawldb and inject URLs in it. These initial URLs 
> will be used then for the initial crawling.
>    Let's inject URLs from the DMOZ Open Directory. First we must download and 
> uncompress the file listing all of the DMOZ pages. 
>    (This is about 300MB compressed file, which uncompressed has 2GB in size, 
> so this will take a few minutes.)
> 
>    on myhost1 machine where we run the  nodename:
> 
>    cd /disk10
>    wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
>    gunzip content.rdf.u8.gz
>    mkdir dmzo
> 
>    A) 5 Milion pages
>       DMOZ contains around 5 million URLs. 
>       /opt/nutch-0.8-dev/bin/nutch org.apache.nutch.tools.DmozParser 
> content.rdf.u8  > dmoz/urls
>       060510 104615 parsing 
> jar:file:/home/opt/nutch-0.8-dev/lib/hadoop-0.2-dev.jar!/hadoop-default.xml
>       060510 104615 parsing 
> file:/home/opt/nutch-0.8-dev/conf/nutch-default.xml
>       060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-site.xml
>       060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/hadoop-site.xml
>       060510 104615 skew = -2131431075
>       060510 104615 Begin parse
>       060510 104616 Client connection to myhost1:9010: starting
>       060510 105156 Completed parse.  Found 4756391 pages.
> 
> 
>    B) as as second choice we can also select a random subset of these pages. 
>       (We can use a random subset so that everyone who runs this tutorial 
> doesn't hammer the same sites.) 
>        DMOZ contains around five million URLs. We select one out of every 
> 1000, so that we end up with around 50000 of URLs:
> 
>       bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 100 
> > dmoz/urls
>       060510 104615 parsing 
> jar:file:/home/opt/nutch-0.8-dev/lib/hadoop-0.2-dev.jar!/hadoop-default.xml
>       060510 104615 parsing 
> file:/home/opt/nutch-0.8-dev/conf/nutch-default.xml
>       060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-site.xml
>       060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/hadoop-site.xml
>       060510 104615 skew = -736060357
>       060510 104615 Begin parse
>       060510 104615 Client connection to myhost1:9010: starting
>       060510 104615 Completed parse.  Found 49498 pages.
> 
>    Here I go for choice B
> 
>    The parser also takes a few minutes, as it must parse the full 2GB file. 
>    Finally, we initialize the crawl db with the selected urls.
>    
>    bin/hadoop dfs -put /disk10/dmoz dmoz
> 
>    060510 101321 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060510 101321 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060510 101321 No FS indicated, using default:myhost1.mydomain.org:9010
>    060510 101321 Client connection to 10.234.57.38:9010: starting
>    060510 101321 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060510 101321 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 
> 
>    bin/hadoop dfs -lsr dmoz
> 
>    060510 134738 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060510 134738 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060510 134738 No FS indicated, using default:myhost1.mydomain.org:9010
>    060510 134738 Client connection to 10.234.57.38:9010: starting
>    /user/nutch/dmoz        <dir>
>    /user/nutch/dmoz/urls   <r 2>   57059180
>    
> 14) CREATE CRAWLDB (INJECT URLs)
> 
>    create e crawldb and inject the urls into the web database.
>    
>    bin/nutch inject test/crawldb dmoz
> 
>    060511 092330 Injector: starting
>    060511 092330 Injector: crawlDb: test/crawldb
>    060511 092330 Injector: urlDir: dmoz
>    060511 092330 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
>    060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
>    060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 092330 Injector: Converting injected urls to crawl db entries.
>    060511 092330 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
>    060511 092330 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060511 092330 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
>    060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 092330 Client connection to 10.234.57.38:9010: starting
>    060511 092330 Client connection to 10.234.57.38:9011: starting
>    060511 092330 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 092332 Running job: job_0001
>    060511 092333  map 0%  reduce 0%
>    060511 092342  map 25%  reduce 0%
>    060511 092344  map 50%  reduce 0%
>    060511 092354  map 75%  reduce 0%
>    060511 092402  map 100%  reduce 0%
>    060511 092412  map 100%  reduce 25%
>    060511 092414  map 100%  reduce 75%
>    060511 092422  map 100%  reduce 100%
>    060511 092423 Job complete: job_0001
>    060511 092423 Injector: Merging injected urls into crawl db.
>    060511 092423 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 092423 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
>    060511 092423 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060511 092423 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060511 092423 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
>    060511 092423 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 092424 Running job: job_0002
>    060511 092425  map 0%  reduce 0%
>    060511 092442  map 25%  reduce 0%
>    060511 092444  map 50%  reduce 0%
>    060511 092454  map 75%  reduce 0%
>    060511 092502  map 100%  reduce 0%
>    060511 092511  map 100%  reduce 25%
>    060511 092513  map 100%  reduce 75%
>    060511 092522  map 100%  reduce 100%
>    060511 092523 Job complete: job_0002
>    060511 092523 Injector: done
> 
> 
>    this will create the test/crawldb folders int the dfs
>    
>    From nutch tutorial : 
>         "The crawl database, or crawldb. This contains information about 
> every url known to Nutch,
>          including whether it was fetched, and, if so, when."
> 
>    You can also see that the fisical filesystem were we put  dsf as also 
> changed few data
>    block files have been created. This on both myhost1 and myhost2 machines 
> which participate to the dfs
> 
>    tree /disk10/hadoopFs
> 
>    /disk10/hadoopFs
>    |-- data
>    |   |-- data
>    |   |   |-- blk_-1388015236827939264
>    |   |   |-- blk_-2961663541591843930
>    |   |   |-- blk_-3901036791232325566
>    |   |   |-- blk_-5212946459038293740
>    |   |   |-- blk_-5301517582607663382
>    |   |   |-- blk_-7397383874477738842
>    |   |   |-- blk_-9055045635688102499
>    |   |   |-- blk_-9056717903919576858
>    |   |   |-- blk_1330666339588899715
>    |   |   |-- blk_1868647544763144796
>    |   |   |-- blk_3136516483028291673
>    |   |   |-- blk_4297959992285923734
>    |   |   |-- blk_5111098874834542511
>    |   |   |-- blk_5224195282207865093
>    |   |   |-- blk_5554003155307698150
>    |   |   |-- blk_7122181909600991812
>    |   |   |-- blk_8745902888438265091
>    |   |   `-- blk_883778723937265061
>    |   `-- tmp
>    |-- mapreduce
>    `-- name
>        |-- edits
>        `-- image
>            `-- fsimage
> 
>    nutch  readdb test/crawldb -dump tmp/crawldbDump1
>    hadoop dfs -lsr
>    hadoop dfs  -get  tmp/crawldbDump1 tmp/
> 
> 15) CREATE FETCHLIST
> 
>    To fetch, we first need to  generate a fetchlist from the injected URLs in 
> the database.
> 
>    This generates a fetchlist for all of the pages due to be fetched. 
>    The fetchlist is placed in a newly created segment directory. 
>    The segment directory is named by the time it's created. 
> 
>    
> 
>    bin/nutch generate test/crawldb test/segments
> 
>    060511 101525 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
>    060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
>    060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 101525 Generator: starting
>    060511 101525 Generator: segment: test/segments/20060511101525
>    060511 101525 Generator: Selecting most-linked urls due for fetch.
>    060511 101525 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
>    060511 101525 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060511 101525 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
>    060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 101525 Client connection to 10.234.57.38:9010: starting
>    060511 101525 Client connection to 10.234.57.38:9011: starting
>    060511 101525 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 101527 Running job: job_0001
>    060511 101528  map 0%  reduce 0%
>    060511 101546  map 50%  reduce 0%
>    060511 101556  map 75%  reduce 0%
>    060511 101606  map 100%  reduce 0%
>    060511 101616  map 100%  reduce 75%
>    060511 101626  map 100%  reduce 100%
>    060511 101627 Job complete: job_0001
>    060511 101627 Generator: Partitioning selected urls by host, for 
> politeness.
>    060511 101627 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 101627 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
>    060511 101627 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060511 101627 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060511 101627 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
>    060511 101627 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 101628 Running job: job_0002
>    060511 101629  map 0%  reduce 0%
>    060511 101646  map 40%  reduce 0%
>    060511 101656  map 60%  reduce 0%
>    060511 101706  map 80%  reduce 0%
>    060511 101717  map 100%  reduce 0%
>    060511 101726  map 100%  reduce 100%
>    060511 101727 Job complete: job_0002
>    060511 101727 Generator: done
> 
> 
>    At the end of this will have the new fetchlist created in
> 
>    test/segments/20060511101525/crawl_generate/part-00000      <r 2>   777933
>    test/segments/20060511101525/crawl_generate/part-00001      <r 2>   751088
>    test/segments/20060511101525/crawl_generate/part-00002      <r 2>   988871
>    test/segments/20060511101525/crawl_generate/part-00003      <r 2>   833454
> 
>    nutch readseg -dump test/segments/20061027135841 
> test/segments/20061027135841/gendump -nocontent -nofetch -noparse 
> -noparsedata -noparsetext
> 
> 16) FETCH
> 
>    Now we run the fetcher on the created segment. This will load the web 
> pages into the segment.
>    
>    bin/nutch fetch test/segments/20060511101525
> 
>    060511 101820 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 101820 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
>    060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
>    060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 101821 Fetcher: starting
>    060511 101821 Fetcher: segment: test/segments/20060511101525
>    060511 101821 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
>    060511 101821 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060511 101821 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
>    060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 101821 Client connection to 10.234.57.38:9011: starting
>    060511 101821 Client connection to 10.234.57.38:9010: starting
>    060511 101821 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 101822 Running job: job_0003
>    060511 101823  map 0%  reduce 0%
>    060511 110818  map 25%  reduce 0%
>    060511 112428  map 50%  reduce 0%
>    060511 122241  map 75%  reduce 0%
>    060511 133613  map 100%  reduce 0%
>    060511 133823  map 100%  reduce 100%
>    060511 133824 Job complete: job_0003
>    060511 133824 Fetcher: done
> 
> 17) UPDATE CRAWLDB 
> 
>    When the fetcher is complete, we update the database with the results of 
> the fetch
>    This will add to the database entries for all of the pages referenced by 
> the initial set
>    in dmoz file.
> 
>    bin/nutch updatedb test/crawldb   test/segments/20060511101525
> 
>    060511 134940 CrawlDb update: starting
>    060511 134940 CrawlDb update: db: test/crawldb
>    060511 134940 CrawlDb update: segment: test/segments/20060511101525
>    060511 134940 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
>    060511 134940 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060511 134940 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
>    060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 134940 Client connection to 10.234.57.38:9010: starting
>    060511 134940 CrawlDb update: Merging segment data into db.
>    060511 134940 Client connection to 10.234.57.38:9011: starting
>    060511 134940 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 134941 Running job: job_0004
>    060511 134942  map 0%  reduce 0%
>    060511 134954  map 17%  reduce 0%
>    060511 135004  map 25%  reduce 0%
>    060511 135013  map 33%  reduce 0%
>    060511 135023  map 42%  reduce 0%
>    060511 135024  map 50%  reduce 0%
>    060511 135034  map 58%  reduce 0%
>    060511 135044  map 67%  reduce 0%
>    060511 135054  map 83%  reduce 0%
>    060511 135104  map 92%  reduce 0%
>    060511 135114  map 100%  reduce 0%
>    060511 135124  map 100%  reduce 100%
>    060511 135125 Job complete: job_0004
>    060511 135125 CrawlDb update: done
> 
>    A) We can now see the crawl statistics: 
>    
>       bin/nutch  readdb test/crawldb  -stats
> 
>       060511 135340 CrawlDb statistics start: test/crawldb
>       060511 135340 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>       060511 135340 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
>       060511 135340 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>       060511 135340 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>       060511 135340 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
>       060511 135340 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>       060511 135340 Client connection to 10.234.57.38:9010: starting
>       060511 135340 Client connection to 10.234.57.38:9011: starting
>       060511 135340 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>       060511 135341 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>       060511 135341 Running job: job_0005
>       060511 135342  map 0%  reduce 0%
>       060511 135353  map 25%  reduce 0%
>       060511 135354  map 50%  reduce 0%
>       060511 135405  map 75%  reduce 0%
>       060511 135414  map 100%  reduce 0%
>       060511 135424  map 100%  reduce 25%
>       060511 135425  map 100%  reduce 50%
>       060511 135434  map 100%  reduce 75%
>       060511 135444  map 100%  reduce 100%
>       060511 135445 Job complete: job_0005
>       060511 135445 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>       060511 135445 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
>       060511 135445 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
>       060511 135445 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>       060511 135445 Statistics for CrawlDb: test/crawldb
>       060511 135445 TOTAL urls:       585055
>       060511 135445 avg score:        1.068
>       060511 135445 max score:        185.981
>       060511 135445 min score:        1.0
>       060511 135445 retry 0:  583943
>       060511 135445 retry 1:  1112
>       060511 135445 status 1 (DB_unfetched):  540202
>       060511 135445 status 2 (DB_fetched):    43086
>       060511 135445 status 3 (DB_gone):       1767
>       060511 135445 CrawlDb statistics: don
> 
>       "I believe the retry numbers are the number of times page fetches 
> failed 
>        for recoverable errors and were re-processed before the page was 
>        fetched.  So most of the pages were fetched on the first try.  Some 
>        encountered errors and were fetched on the next try and so on.  The 
>        default setting is a max 3 retrys in the db.fetch.retry.max property."
> 
>     
>     B) We can now dump the crawled db to a flat file into dfs and get a copy 
> out to a local file
> 
>        bin/nutch readdb test/crawldb  -dump  mydump
> 
>        060511 135603 CrawlDb dump: starting
>        060511 135603 CrawlDb db: test/crawldb
>        060511 135603 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>        060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
>        060511 135603 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>        060511 135603 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>        060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
>        060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>        060511 135603 Client connection to 10.234.57.38:9010: starting
>        060511 135603 Client connection to 10.234.57.38:9011: starting
>        060511 135603 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>        060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>        060511 135604 Running job: job_0006
>        060511 135605  map 0%  reduce 0%
>        060511 135624  map 50%  reduce 0%
>        060511 135634  map 75%  reduce 0%
>        060511 135644  map 100%  reduce 0%
>        060511 135654  map 100%  reduce 25%
>        060511 135704  map 100%  reduce 100%
>        060511 135705 Job complete: job_0006
>        060511 135705 CrawlDb dump: done
> 
>        bin/hadoop dfs -lsr mydump
>        060511 135802 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>        060511 135802 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>        060511 135803 No FS indicated, using default:myhost1.mydomain.org:9010
>        060511 135803 Client connection to 10.234.57.38:9010: starting
>        /user/nutch/mydump/part-00000   <r 2>   39031197
>        /user/nutch/mydump/part-00001   <r 2>   39186940
>        /user/nutch/mydump/part-00002   <r 2>   38954809
>        /user/nutch/mydump/part-00003   <r 2>   39171283
> 
>     
>        bin/hadoop dfs  -get  mydump/part-00000 mydumpFile
> 
>        060511 135848 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>        060511 135848 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>        060511 135848 No FS indicated, using default:myhost1.mydomain.org:9010
>        060511 135848 Client connection to 10.234.57.38:9010: starting
> 
>        more mydumpFile
>        
> gopher://csf.Colorado.EDU/11/ipe/Thematic_Archive/newsletters/africa_information_afrique_net/Angola
>      Version: 4
>        Status: 1 (DB_unfetched)
>        Fetch time: Thu May 11 13:38:09 CEST 2006
>        Modified time: Thu Jan 01 01:00:00 CET 1970
>        Retries since fetch: 0
>        Retry interval: 30.0 days
>        Score: 1.0666667
>        Signature: null
>        Metadata: null
>        
>        gopher://gopher.gwdg.de/11/Uni/igdl     Version: 4
>        Status: 1 (DB_unfetched)
>        Fetch time: Thu May 11 13:37:03 CEST 2006
>        Modified time: Thu Jan 01 01:00:00 CET 1970
>        Retries since fetch: 0
>        Retry interval: 30.0 days
>        Score: 1.0140845
>        Signature: null
>        Metadata: null
>        
>        gopher://gopher.jer1.co.il:70/00/jorgs/npo/camera/media/1994/npr       
>  Version: 4
>        Status: 1 (DB_unfetched)
>        Fetch time: Thu May 11 13:36:48 CEST 2006
>        Modified time: Thu Jan 01 01:00:00 CET 1970
>        Retries since fetch: 0
>        Retry interval: 30.0 days
>        Score: 1.0105263
>        Signature: null
>        Metadata: null
> 
>        ...
>        ...
>        ...
>   
> 18) INVERT LINKS
> 
>    Before indexing we first invert all of the links, so that we may index 
> incoming anchor text with the pages.
>    We now need to generate a linkDb, that is done with all segments in your 
> segments folder
> 
>    bin/nutch invertlinks linkdb test/segments/20060511101525
> 
>    060511 140228 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
>    060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
>    060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 140228 LinkDb: starting
>    060511 140228 LinkDb: linkdb: linkdb
>    060511 140228 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 140228 Client connection to 10.234.57.38:9010: starting
>    060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
>    060511 140228 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060511 140228 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
>    060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
>    060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 140228 LinkDb: adding segment: test/segments/20060511101525
>    060511 140228 Client connection to 10.234.57.38:9011: starting
>    060511 140228 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
>    060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060511 140229 Running job: job_0007
>    060511 140230  map 0%  reduce 0%
>    060511 140255  map 50%  reduce 0%
>    060511 140305  map 75%  reduce 0%
>    060511 140314  map 100%  reduce 0%
>    060511 140324  map 100%  reduce 100%
>    060511 140325 Job complete: job_0007
>    060511 140325 LinkDb: done
> 
> 23) INDEX SEGMENT
> 
>    To index the segment we use the index command, as follows.
> 
>    bin/nutch  index  test/indexes test/crawldb linkdb 
> test/segments/20060511101525
> 
>    060515 134738 Indexer: starting
>    060515 134738 Indexer: linkdb: linkdb
>    060515 134738 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/hadoop-default.xml
>    060515 134738 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
>    060515 134738 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/mapred-default.xml
>    060515 134738 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/mapred-default.xml
>    060515 134738 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
>    060515 134738 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060515 134738 Indexer: adding segment: test/segments/20060511101525
>    060515 134738 Client connection to 10.234.57.38:9010: starting
>    060515 134738 Client connection to 10.234.57.38:9011: starting
>    060515 134739 parsing 
> jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/hadoop-default.xml
>    060515 134739 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
>    060515 134739 Running job: job_0006
>    060515 134741  map 0%  reduce 0%
>    060515 134758  map 11%  reduce 0%
>    060515 134808  map 18%  reduce 0%
>    060515 134818  map 25%  reduce 0%
>    060515 134827  map 38%  reduce 2%
>    060515 134837  map 44%  reduce 2%
>    060515 134847  map 50%  reduce 9%
>    060515 134857  map 53%  reduce 11%
>    060515 134908  map 59%  reduce 13%
>    060515 134918  map 66%  reduce 13%
>    060515 134928  map 71%  reduce 13%
>    060515 134938  map 74%  reduce 13%
>    060515 134948  map 88%  reduce 16%
>    060515 134957  map 94%  reduce 17%
>    060515 135007  map 100%  reduce 22%
>    060515 135017  map 100%  reduce 50%
>    060515 135028  map 100%  reduce 78%
>    060515 135038  map 100%  reduce 82%
>    060515 135048  map 100%  reduce 87%
>    060515 135058  map 100%  reduce 92%
>    060515 135108  map 100%  reduce 97%
>    060515 135117  map 100%  reduce 99%
>    060515 135118  map 100%  reduce 100%
>    060515 135129 Job complete: job_0006
>    060515 135129 Indexer: done
> 
> 24) Try Searching the engine  using nutch itself
>    
>    Nutch  looks for index and segements subdirectory of dfs in the directory 
> defined by th searcher.dir property.
>    edit the /nutch-site.xml and add the following lines:
>    
>       <property>
>          <name>searcher.dir</name>
>          <value>test</value>
>          <description>
>              Path to root of crawl.  This directory is searched (in order)
>              for either the file search-servers.txt, containing a list of
>              distributed search servers, or the directory "index" containing
>              merged indexes, or the directory "segments" containing segment
>              indexes.
>        </description>
>        </property>
> 
>    This is where search look for stuff as explained in description.
>    Now run the search using nutch itself, 
> 
>    Example : 
> 
>       /opt/nutch/bin/nutch  org.apache.nutch.searcher.NutchBean developpement
> 
> 26) Search the engine using the brawser.
> 
>    To search you need to have tomcat installed and put the nutch war file 
> into tomcat servlet container.
>    I have build and installed tomcat as /opt/tomcat
> 
>    Note: (important)
>          Something interesting to note about the distributed filesystem is 
> that it is user specific. 
>          If you store a directory urls under the filesystem with the nutch 
> user, it is actually stored as /user/nutch/urls. 
>          What this means to us is that the user that does the crawl and 
> stores it in the distributed filesystem 
>          must also be the user that starts the search, or no results will 
> come back. 
>          You can try this yourself by logging in with a different user and 
> runing the ls command.
>          It won't find the directories because is it looking under a 
> different directory /user/username instead of /user/nutch
> 
>    As explained above we need to run tomcat as nutch user in order to be sure 
> to have search results; Be sure to have write 
>    permission to nutch logs directory and read permission on the rest of the 
> tomcat installation:
> 
>    login as root 
>    chmod -R ugo+rx  /opt/nutch
>    chmod -R ugo+rwx /opt/nutch/logs
> 
>    export CATALINA_OPTS="-server -Xss256k -Xms768m -Xmx768m 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true" 
> 
>    rm -rf /opt/tomcat/webapps/ROOT*
>    cp /opt/nutch/nutch*.war /opt/tomcat/webapps/ROOT.war
>    /opt/tomcat/bin/startup.sh
>    
>    this should create a new webapps/ROOT rootdir
> 
>    We now have to ensure that the webapp (tomcat) can find the index and 
> segments.
>    Tomcat webapp will use the nutch configuration file under 
>      
>    /opt/tomcat/webapps/ROOT/WEB-INF/classes
>    
>    copy in here your modified nutch configuration files from nutch/conf 
> directory:
> 
>    cp /opt/nutch/conf/hadoop-site.xml 
> /opt/tomcat/webapps/ROOT/WEB-INF/classes//hadoop-site.xml
>    cp /opt/nutch/conf/hadoop-env.sh   
> /opt/tomcat/webapps/ROOT/WEB-INF/classes//hadoop-env.sh
>    cp /opt/nutch/conf/nutch-site.xml  
> /opt/tomcat/webapps/ROOT/WEB-INF/classes//hadoop-site.xml
> 
>    now you will need to restart tomcat and enter the following URL into your 
> brawser:
> 
>    http://localhost:8080
> 
>    the nutch search page should appear
> 
> 27)  RECRAWLING
> 
>    Now that everything works we update our db with new URLS 
> 
>    A) we create the fetch list with the to 100 scoring pages in the current DB
> 
>       bin/nutch generate test/crawldb test/segments -topN 100
> 
>       this has generated the new segment : test/segments/20060516135945
> 
>    B) Now we fetch the new pages
>     
>       bin/nutch fetch test/segments/20060516135945
> 
>    C) The DB is now updated with the entries of the new pages
> 
>       bin/nutch updatedb test/crawldb test/segments/20060516135945
> 
>    D) We now we inver links.  I guess the I could have just invert links on 
> test/segments/20060516135945
>       but here I do it on all segments
> 
>       bin/nutch invertlinks linkdb -dir test/segments
> 
>    E) Remove the test/indexes directory
>       
>       hadoop dfs -rm test/indexes
>       
>    F) Now we recreate indexes 
> 
>       nutch index test/indexes test/crawldb linkdb 
> test/segments/20060511101525 test/segments/20060516135945
> 
>    G) DEDUP
> 
>       bin/nutch dedup test/indexes
> 
>    H) Merge indexes 
> 
>       bin/nutch merge test/index test/indexes
> 
>    I) Now if you would like you can evene remove test/indexes
> 
> 
>    I have also tried to index segment in separate indexes directory like :
> 
>    nutch index test/indexes1 linkdb test/segments/20060511101525
>    nutch index test/indexes2 linkdb test/segments/20060516135945
>    bin/nutch merge test/index test/indexes1 test/indexes2
> 
>    it looks is working and this will avoid to index each segment all time
>    we will instead index just the new segment and we just have to regenerate 
> the new meeged
>    index
> 
> 
>    Another solution for merging coulde have been to index each segment into a 
> different index directory:
>    
>    nutch index indexe1 test/crawldb linkdb test/segments/20060511101525 
>    nutch index indexe2 test/crawldb linkdb test/segments/20060516135945
>    nutch merge test/index test/indexe1 test/indexe2 
> 
>    Another solution again is to merge the segment and index only the 
> resulting merged segment
>    but so far I did'nt succeed in doing so.
> 
> 
> 
> #
> #
> #nutch crawl dmoz/urls -dir crawl-tinysite -depth 10
>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch Step by Step Maybe someone will find this useful ?

Reply via email to