Great! Thank you Corrado! ----- Original Message ----- From: "zzcgiacomini" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Wednesday, April 04, 2007 10:53 PM Subject: Nutch Step by Step Maybe someone will find this useful ?
>I have spent sometime playing with nutch-0 and collecting notes from the > mailing lists ... > may be someone will find these notes useful end could point me out > mistakes > I am not at all a nutch expert... > -Corrado > > > > > -------------------------------------------------------------------------------- > 0) CREATE NUTCH USER AND GROUP > > Create a nutch user and group and perform all the following logged in as > nutch user. > put this line in your .bash_profile > > export JAVA_HOME=/opt/jdk > export PATH=$JAVA_HOME/bin:$PATH > > 1) GET HADOOP and NUTCH > > downloaded the nutch and hadoop trunks as well explained on > http://lucene.apache.org/hadoop/version_control.html > (svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk) > (svn checkout http://svn.apache.org/repos/asf/lucene/hadoop/trunk) > > 2) BUILD HADOOP > > Ex: > > Build and produce the tar file > cd hadoop/trunk > ant tar > > To build hadoop with native libraries 64bits proceed as follow : > > A ) dowonload and install latest lzo library > (http://www.oberhumer.com/opensource/lzo/download/) > Note: the current available pkgs for fc5 are too old > > tar xvzf lzo-2.02.tar.gz > cd lzo-2.02 > ./configure --prefix=/opt/lzo-2.02 > make install > > B) compile native 64bit libs for hadoop if needed > > cd hadoop/trunk/src/native > > export LDFLAGS=-L/opt/jdk/jre/lib/amd64/server > export JVM_DATA_MODEL=64 > > CCFLAGS="-I/opt/lzo-2.02/include" CPPFLAGS="-I/opt/lzo-2.02/include" > ./configure > > cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/ > cp src/org_apache_hadoop.h ./src/org/apache/hadoop/io/compress/lzo > cp > src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h > src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h > cp > src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h > src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h > > in config.h replace the line > > #define HADOOP_LZO_LIBRARY libnotfound.so > > with this one > > #define HADOOP_LZO_LIBRARY "liblzo2.so" > make > > 3) BUILD NUTCH > > nutch-dev nigthly trunk now comes with hadoop-0.12.jar but may be you want > to put the last nightly build hadoop jar > > mv nutch/trunk/lib/hadoop-0.12.jar nutch/trunk/lib/hadoop-0.12.jar.ori > cp hadoop/trunk/build/hadoop-0.12.3-dev.jar nutch/trunk/lib/hadoop-0.12.jar > cd nutch/trunk > ant tar > > 4) INSTALL > > copy and untar the genearated .tar.gz file on the machines that will > participate to the engine activities > In my case I only have two identical machines available called myhost2 and > myhost1. > > On each of them I have installed nutch binaries under /opt/nutch while I > have dicided to have the hadoop > distributed filesystem in a directory called hadoopFs located under a > large disk munted on /disk10 > > > on both machines create the directory: > mkdir /disk10/hadoopFs/ > > copy hadoop 64bit native libraries if needed > > mkdir /opt/nutch/lib/native/Linux-x86_64 > cp -fl hadoop/trunk/src/native/lib/.libs/* > /opt/nutch/lib/native/Linux-x86_64 > > 5) CONFIG > > I will use the myhost1 as the master machine running the nodename and > jobtracker tasks; it will also run the datanode and tasktraker on it. > myhost2 will only run datanode and takstraker. > > A) on both the machines change the conf/hadoop-site.xml configuration > file. Here are values I have used > > fs.default.name : myhost1.mydomain.org:9010 > mapred.job.tracker : myhost1.mydomain.org:9011 > mapred.map.tasks : 40 > mapred.reduce.tasks : 3 > dfs.name.dir : /opt/hadoopFs/name > dfs.data.dir : /opt/hadoopFs/data > mapred.system.dir : /opt/hadoopFs/mapreduce/system > mapred.local.dir : /opt/hadoopFs/mapreduce/local > dfs.replication : 2 > > "The mapred.map.tasks property tell how many tasks you want to run in > parallel. > This should be a multiple of the number of computers that you have. > In our case since we are starting out with 2 computer we will have 4 > map and 4 reduce tasks. > > "The dfs.replication property states how many servers a single file > should be > replicated to before it becomes available. Because we are using 2 > servers I have set > this at 2. > > may be you want also change nutch-site by adding with a different > value then the default of 3 > > http.redirect.max : 10 > > > B) be sure that your conf/slaves file contains the name of the slaves > machines. In my cases: > > myhost1.mydomain.org > myhost2.mydomain.org > > C) create directories for pids and log files on both machines > > mkdir /opt/nutch/pids > mkdir /opt/nutch/logs > > D) on both machines change conf/hadoop-env.sh file to point to the right > java and nutch installation. > > export HADOOP_HOME=/opt/nutch > export JAVA_HOME=/opt/jdk > export HADOOP_LOG_DIR=${HADOOP_HOME}/logs > export HADOOP_PID_DIR=${HADOOP_HOME}/pids > > E) Because of a problem on the classloader in nutch the following lines > need to be set > in nutch/bin/hadoop script file before it star building the CLASSSPATH > variable > > for f in $HADOOP_HOME/nutch-*.jar; do > CLASSPATH=${CLASSPATH}:$f; > done > > This will put nutch-*.jar file into CLASSPATH > > 6) SSH SETUP ( Important!! ) > > Setup ssh as explained in http://wiki.apache.org/nutch/NutchHadoopTutorial > and test the ability to password-less login on itself and from myhost1 to > bas24 and viceversa. > This is a very important step to avoid communication refused problems > between daemons. > > Here is a short example on how to proceed : > A) use ssh-keygen to create .ssh/id_dsa files : > > ssh-keygen -t dsa > Generating public/private dsa key pair. > Enter file in which to save the key (/home/nutch/.ssh/id_dsa): > Enter passphrase (empty for no passphrase): > Enter same passphrase again: > Your identification has been saved in /home/nutch/.ssh/id_dsa. > Your public key has been saved in /home/nutch/.ssh/id_dsa.pub. > The key fingerprint is: > 01:36:6c:9d:27:09:54:e4:ff:fb:20:86:8c:e1:6c:82 [EMAIL PROTECTED] > > B) copy .ssh/id_dsa.pub on all machines as .ssh/authorized_keys > C) on each machine configure ssh-agent to start at login adding a line > in .xsession > ex : ssh-agent startkde. > > or eval `ssh-agent` in .bashrc ( this will start an ssh-agent for > every new shell) > D) Use ssh-ad to add the dsa key > > ssh-add > Enter passphrase for /home/nutch/.ssh/id_dsa: > Identity added: /home/nutch/.ssh/id_dsa (/home/nutch/.ssh/id_dsa) > > > 7) FORMAT HADOOP FILESYSTEM > > "Fix for HADOOP-19. A namenode must now be formatted before it may be > used. Attempts to > start a namenode in an unformatted directory will fail, rather than > automatically > creating a new, empty filesystem, causing existing datanodes to delete all > blocks. > Thus a mis-configured dfs.data.dir should no longer cause data loss" > > on the master machine (myhost1) run these command: > cd /opt/nutch/ > bin/hadoop namenode -format > > This will create the /opt/hadoopFs/name/image directory > > 8) START NODENAME > > start the namenode on the master machine (myhost1) > > bin/hadoop-daemon.sh start namenode > > starting namenode, logging to > /opt/nutch/logs/hadoop-nutch-namenode-myhost1.mydomain.org.out > 060509 150431 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060509 150431 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060509 150431 directing logs to directory /opt/nutch/logs > > 9) START DATANODES > > starting datanode on the master and all slaves machines (myhost1 and > myhost2) > > on myhost1: > > bin/hadoop-daemon.sh start datanode > > tarting datanode, logging to > /opt/nutch/logs/hadoop-nutch-datanode-myhost1.mydomain.org.out > 060509 150619 0x0000000a parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060509 150619 0x0000000a parsing > file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060509 150619 0x0000000a directing logs to directory /opt/nutch/logs > > on myhost2: > > bin/hadoop-daemon.sh start datanode > > starting datanode, logging to > /opt/nutch/logs/hadoop-nutch-datanode-myhost2.mydomain.org.out > 060509 151517 0x0000000a parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060509 151517 0x0000000a parsing > file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060509 151517 0x0000000a directing logs to directory /opt/nutch/logs > > > 10) START JOBTRAKER > > start jobtracker on the master machine (myhost1) > > on myhost1 > > bin/hadoop-daemon.sh start jobtracker > > starting jobtracker, logging to > /opt/nutch/logs/hadoop-nutch-jobtracker-myhost1.mydomain.org.out > 060509 152020 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060509 152021 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060509 152021 directing logs to directory /opt/nutch/logs > > 11) START TASKTARKERS > > start tasktracker on the slaves machines (myhost2 and myhost1) > > on myhost1: > > bin/hadoop-daemon.sh start tasktracker > > starting tasktracker, logging to > /opt/nutch/logs/hadoop-nutch-tasktracker-myhost1.mydomain.org.out > 060509 152236 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060509 152236 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060509 152236 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060509 152236 directing logs to directory /opt/nutch/logs > > on myhost2: > > bin/hadoop-daemon.sh start tasktracker > > starting tasktracker, logging to > /opt/nutch/logs/hadoop-nutch-tasktracker-myhost2.mydomain.org.out > 060509 152333 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060509 152333 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060509 152333 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060509 152333 directing logs to directory /opt/nutch/logs > > NOTE: Now that we have verified that daemons start and connects properly > we can star and > stop all of them using the start-all.sh and stop-all. scripts from > the master machine > > 12) TEST FUNCTIONALITY > > Test hadoop functionality ... just a simple ls > > bin/hadoop dfs -ls > > 060509 152844 parsing > jar:file:/home/opt/nutch-0.8-dev/lib/hadoop-0.2-dev.jar!/hadoop-default.xml > 060509 152845 parsing file:/home/opt/nutch-0.8-dev/conf/hadoop-site.xml > 060509 152845 No FS indicated, using default:localhost:9010 > 060509 152845 Client connection to 127.0.0.1:9010: starting > Found 0 items > > The dfs filesystem is empty.. of course.. > > 13) CRATE FILE FOR URLs INJECT > > Now we need to create a crawldb and inject URLs in it. These initial URLs > will be used then for the initial crawling. > Let's inject URLs from the DMOZ Open Directory. First we must download and > uncompress the file listing all of the DMOZ pages. > (This is about 300MB compressed file, which uncompressed has 2GB in size, > so this will take a few minutes.) > > on myhost1 machine where we run the nodename: > > cd /disk10 > wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz > gunzip content.rdf.u8.gz > mkdir dmzo > > A) 5 Milion pages > DMOZ contains around 5 million URLs. > /opt/nutch-0.8-dev/bin/nutch org.apache.nutch.tools.DmozParser > content.rdf.u8 > dmoz/urls > 060510 104615 parsing > jar:file:/home/opt/nutch-0.8-dev/lib/hadoop-0.2-dev.jar!/hadoop-default.xml > 060510 104615 parsing > file:/home/opt/nutch-0.8-dev/conf/nutch-default.xml > 060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-site.xml > 060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/hadoop-site.xml > 060510 104615 skew = -2131431075 > 060510 104615 Begin parse > 060510 104616 Client connection to myhost1:9010: starting > 060510 105156 Completed parse. Found 4756391 pages. > > > B) as as second choice we can also select a random subset of these pages. > (We can use a random subset so that everyone who runs this tutorial > doesn't hammer the same sites.) > DMOZ contains around five million URLs. We select one out of every > 1000, so that we end up with around 50000 of URLs: > > bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 100 > > dmoz/urls > 060510 104615 parsing > jar:file:/home/opt/nutch-0.8-dev/lib/hadoop-0.2-dev.jar!/hadoop-default.xml > 060510 104615 parsing > file:/home/opt/nutch-0.8-dev/conf/nutch-default.xml > 060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-site.xml > 060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/hadoop-site.xml > 060510 104615 skew = -736060357 > 060510 104615 Begin parse > 060510 104615 Client connection to myhost1:9010: starting > 060510 104615 Completed parse. Found 49498 pages. > > Here I go for choice B > > The parser also takes a few minutes, as it must parse the full 2GB file. > Finally, we initialize the crawl db with the selected urls. > > bin/hadoop dfs -put /disk10/dmoz dmoz > > 060510 101321 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060510 101321 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060510 101321 No FS indicated, using default:myhost1.mydomain.org:9010 > 060510 101321 Client connection to 10.234.57.38:9010: starting > 060510 101321 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060510 101321 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > > bin/hadoop dfs -lsr dmoz > > 060510 134738 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060510 134738 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060510 134738 No FS indicated, using default:myhost1.mydomain.org:9010 > 060510 134738 Client connection to 10.234.57.38:9010: starting > /user/nutch/dmoz <dir> > /user/nutch/dmoz/urls <r 2> 57059180 > > 14) CREATE CRAWLDB (INJECT URLs) > > create e crawldb and inject the urls into the web database. > > bin/nutch inject test/crawldb dmoz > > 060511 092330 Injector: starting > 060511 092330 Injector: crawlDb: test/crawldb > 060511 092330 Injector: urlDir: dmoz > 060511 092330 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml > 060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml > 060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 092330 Injector: Converting injected urls to crawl db entries. > 060511 092330 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml > 060511 092330 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 092330 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml > 060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 092330 Client connection to 10.234.57.38:9010: starting > 060511 092330 Client connection to 10.234.57.38:9011: starting > 060511 092330 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 092332 Running job: job_0001 > 060511 092333 map 0% reduce 0% > 060511 092342 map 25% reduce 0% > 060511 092344 map 50% reduce 0% > 060511 092354 map 75% reduce 0% > 060511 092402 map 100% reduce 0% > 060511 092412 map 100% reduce 25% > 060511 092414 map 100% reduce 75% > 060511 092422 map 100% reduce 100% > 060511 092423 Job complete: job_0001 > 060511 092423 Injector: Merging injected urls into crawl db. > 060511 092423 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 092423 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml > 060511 092423 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 092423 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 092423 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml > 060511 092423 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 092424 Running job: job_0002 > 060511 092425 map 0% reduce 0% > 060511 092442 map 25% reduce 0% > 060511 092444 map 50% reduce 0% > 060511 092454 map 75% reduce 0% > 060511 092502 map 100% reduce 0% > 060511 092511 map 100% reduce 25% > 060511 092513 map 100% reduce 75% > 060511 092522 map 100% reduce 100% > 060511 092523 Job complete: job_0002 > 060511 092523 Injector: done > > > this will create the test/crawldb folders int the dfs > > From nutch tutorial : > "The crawl database, or crawldb. This contains information about > every url known to Nutch, > including whether it was fetched, and, if so, when." > > You can also see that the fisical filesystem were we put dsf as also > changed few data > block files have been created. This on both myhost1 and myhost2 machines > which participate to the dfs > > tree /disk10/hadoopFs > > /disk10/hadoopFs > |-- data > | |-- data > | | |-- blk_-1388015236827939264 > | | |-- blk_-2961663541591843930 > | | |-- blk_-3901036791232325566 > | | |-- blk_-5212946459038293740 > | | |-- blk_-5301517582607663382 > | | |-- blk_-7397383874477738842 > | | |-- blk_-9055045635688102499 > | | |-- blk_-9056717903919576858 > | | |-- blk_1330666339588899715 > | | |-- blk_1868647544763144796 > | | |-- blk_3136516483028291673 > | | |-- blk_4297959992285923734 > | | |-- blk_5111098874834542511 > | | |-- blk_5224195282207865093 > | | |-- blk_5554003155307698150 > | | |-- blk_7122181909600991812 > | | |-- blk_8745902888438265091 > | | `-- blk_883778723937265061 > | `-- tmp > |-- mapreduce > `-- name > |-- edits > `-- image > `-- fsimage > > nutch readdb test/crawldb -dump tmp/crawldbDump1 > hadoop dfs -lsr > hadoop dfs -get tmp/crawldbDump1 tmp/ > > 15) CREATE FETCHLIST > > To fetch, we first need to generate a fetchlist from the injected URLs in > the database. > > This generates a fetchlist for all of the pages due to be fetched. > The fetchlist is placed in a newly created segment directory. > The segment directory is named by the time it's created. > > > > bin/nutch generate test/crawldb test/segments > > 060511 101525 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml > 060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml > 060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 101525 Generator: starting > 060511 101525 Generator: segment: test/segments/20060511101525 > 060511 101525 Generator: Selecting most-linked urls due for fetch. > 060511 101525 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml > 060511 101525 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 101525 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml > 060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 101525 Client connection to 10.234.57.38:9010: starting > 060511 101525 Client connection to 10.234.57.38:9011: starting > 060511 101525 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 101527 Running job: job_0001 > 060511 101528 map 0% reduce 0% > 060511 101546 map 50% reduce 0% > 060511 101556 map 75% reduce 0% > 060511 101606 map 100% reduce 0% > 060511 101616 map 100% reduce 75% > 060511 101626 map 100% reduce 100% > 060511 101627 Job complete: job_0001 > 060511 101627 Generator: Partitioning selected urls by host, for > politeness. > 060511 101627 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 101627 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml > 060511 101627 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 101627 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 101627 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml > 060511 101627 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 101628 Running job: job_0002 > 060511 101629 map 0% reduce 0% > 060511 101646 map 40% reduce 0% > 060511 101656 map 60% reduce 0% > 060511 101706 map 80% reduce 0% > 060511 101717 map 100% reduce 0% > 060511 101726 map 100% reduce 100% > 060511 101727 Job complete: job_0002 > 060511 101727 Generator: done > > > At the end of this will have the new fetchlist created in > > test/segments/20060511101525/crawl_generate/part-00000 <r 2> 777933 > test/segments/20060511101525/crawl_generate/part-00001 <r 2> 751088 > test/segments/20060511101525/crawl_generate/part-00002 <r 2> 988871 > test/segments/20060511101525/crawl_generate/part-00003 <r 2> 833454 > > nutch readseg -dump test/segments/20061027135841 > test/segments/20061027135841/gendump -nocontent -nofetch -noparse > -noparsedata -noparsetext > > 16) FETCH > > Now we run the fetcher on the created segment. This will load the web > pages into the segment. > > bin/nutch fetch test/segments/20060511101525 > > 060511 101820 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 101820 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml > 060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml > 060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 101821 Fetcher: starting > 060511 101821 Fetcher: segment: test/segments/20060511101525 > 060511 101821 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml > 060511 101821 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 101821 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml > 060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 101821 Client connection to 10.234.57.38:9011: starting > 060511 101821 Client connection to 10.234.57.38:9010: starting > 060511 101821 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 101822 Running job: job_0003 > 060511 101823 map 0% reduce 0% > 060511 110818 map 25% reduce 0% > 060511 112428 map 50% reduce 0% > 060511 122241 map 75% reduce 0% > 060511 133613 map 100% reduce 0% > 060511 133823 map 100% reduce 100% > 060511 133824 Job complete: job_0003 > 060511 133824 Fetcher: done > > 17) UPDATE CRAWLDB > > When the fetcher is complete, we update the database with the results of > the fetch > This will add to the database entries for all of the pages referenced by > the initial set > in dmoz file. > > bin/nutch updatedb test/crawldb test/segments/20060511101525 > > 060511 134940 CrawlDb update: starting > 060511 134940 CrawlDb update: db: test/crawldb > 060511 134940 CrawlDb update: segment: test/segments/20060511101525 > 060511 134940 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml > 060511 134940 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 134940 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml > 060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 134940 Client connection to 10.234.57.38:9010: starting > 060511 134940 CrawlDb update: Merging segment data into db. > 060511 134940 Client connection to 10.234.57.38:9011: starting > 060511 134940 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 134941 Running job: job_0004 > 060511 134942 map 0% reduce 0% > 060511 134954 map 17% reduce 0% > 060511 135004 map 25% reduce 0% > 060511 135013 map 33% reduce 0% > 060511 135023 map 42% reduce 0% > 060511 135024 map 50% reduce 0% > 060511 135034 map 58% reduce 0% > 060511 135044 map 67% reduce 0% > 060511 135054 map 83% reduce 0% > 060511 135104 map 92% reduce 0% > 060511 135114 map 100% reduce 0% > 060511 135124 map 100% reduce 100% > 060511 135125 Job complete: job_0004 > 060511 135125 CrawlDb update: done > > A) We can now see the crawl statistics: > > bin/nutch readdb test/crawldb -stats > > 060511 135340 CrawlDb statistics start: test/crawldb > 060511 135340 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 135340 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml > 060511 135340 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 135340 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 135340 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml > 060511 135340 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 135340 Client connection to 10.234.57.38:9010: starting > 060511 135340 Client connection to 10.234.57.38:9011: starting > 060511 135340 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 135341 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 135341 Running job: job_0005 > 060511 135342 map 0% reduce 0% > 060511 135353 map 25% reduce 0% > 060511 135354 map 50% reduce 0% > 060511 135405 map 75% reduce 0% > 060511 135414 map 100% reduce 0% > 060511 135424 map 100% reduce 25% > 060511 135425 map 100% reduce 50% > 060511 135434 map 100% reduce 75% > 060511 135444 map 100% reduce 100% > 060511 135445 Job complete: job_0005 > 060511 135445 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 135445 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml > 060511 135445 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml > 060511 135445 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 135445 Statistics for CrawlDb: test/crawldb > 060511 135445 TOTAL urls: 585055 > 060511 135445 avg score: 1.068 > 060511 135445 max score: 185.981 > 060511 135445 min score: 1.0 > 060511 135445 retry 0: 583943 > 060511 135445 retry 1: 1112 > 060511 135445 status 1 (DB_unfetched): 540202 > 060511 135445 status 2 (DB_fetched): 43086 > 060511 135445 status 3 (DB_gone): 1767 > 060511 135445 CrawlDb statistics: don > > "I believe the retry numbers are the number of times page fetches > failed > for recoverable errors and were re-processed before the page was > fetched. So most of the pages were fetched on the first try. Some > encountered errors and were fetched on the next try and so on. The > default setting is a max 3 retrys in the db.fetch.retry.max property." > > > B) We can now dump the crawled db to a flat file into dfs and get a copy > out to a local file > > bin/nutch readdb test/crawldb -dump mydump > > 060511 135603 CrawlDb dump: starting > 060511 135603 CrawlDb db: test/crawldb > 060511 135603 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml > 060511 135603 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 135603 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml > 060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 135603 Client connection to 10.234.57.38:9010: starting > 060511 135603 Client connection to 10.234.57.38:9011: starting > 060511 135603 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 135604 Running job: job_0006 > 060511 135605 map 0% reduce 0% > 060511 135624 map 50% reduce 0% > 060511 135634 map 75% reduce 0% > 060511 135644 map 100% reduce 0% > 060511 135654 map 100% reduce 25% > 060511 135704 map 100% reduce 100% > 060511 135705 Job complete: job_0006 > 060511 135705 CrawlDb dump: done > > bin/hadoop dfs -lsr mydump > 060511 135802 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 135802 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 135803 No FS indicated, using default:myhost1.mydomain.org:9010 > 060511 135803 Client connection to 10.234.57.38:9010: starting > /user/nutch/mydump/part-00000 <r 2> 39031197 > /user/nutch/mydump/part-00001 <r 2> 39186940 > /user/nutch/mydump/part-00002 <r 2> 38954809 > /user/nutch/mydump/part-00003 <r 2> 39171283 > > > bin/hadoop dfs -get mydump/part-00000 mydumpFile > > 060511 135848 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 135848 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 135848 No FS indicated, using default:myhost1.mydomain.org:9010 > 060511 135848 Client connection to 10.234.57.38:9010: starting > > more mydumpFile > > gopher://csf.Colorado.EDU/11/ipe/Thematic_Archive/newsletters/africa_information_afrique_net/Angola > Version: 4 > Status: 1 (DB_unfetched) > Fetch time: Thu May 11 13:38:09 CEST 2006 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 30.0 days > Score: 1.0666667 > Signature: null > Metadata: null > > gopher://gopher.gwdg.de/11/Uni/igdl Version: 4 > Status: 1 (DB_unfetched) > Fetch time: Thu May 11 13:37:03 CEST 2006 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 30.0 days > Score: 1.0140845 > Signature: null > Metadata: null > > gopher://gopher.jer1.co.il:70/00/jorgs/npo/camera/media/1994/npr > Version: 4 > Status: 1 (DB_unfetched) > Fetch time: Thu May 11 13:36:48 CEST 2006 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 30.0 days > Score: 1.0105263 > Signature: null > Metadata: null > > ... > ... > ... > > 18) INVERT LINKS > > Before indexing we first invert all of the links, so that we may index > incoming anchor text with the pages. > We now need to generate a linkDb, that is done with all segments in your > segments folder > > bin/nutch invertlinks linkdb test/segments/20060511101525 > > 060511 140228 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml > 060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml > 060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 140228 LinkDb: starting > 060511 140228 LinkDb: linkdb: linkdb > 060511 140228 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 140228 Client connection to 10.234.57.38:9010: starting > 060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml > 060511 140228 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 140228 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml > 060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml > 060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 140228 LinkDb: adding segment: test/segments/20060511101525 > 060511 140228 Client connection to 10.234.57.38:9011: starting > 060511 140228 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml > 060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060511 140229 Running job: job_0007 > 060511 140230 map 0% reduce 0% > 060511 140255 map 50% reduce 0% > 060511 140305 map 75% reduce 0% > 060511 140314 map 100% reduce 0% > 060511 140324 map 100% reduce 100% > 060511 140325 Job complete: job_0007 > 060511 140325 LinkDb: done > > 23) INDEX SEGMENT > > To index the segment we use the index command, as follows. > > bin/nutch index test/indexes test/crawldb linkdb > test/segments/20060511101525 > > 060515 134738 Indexer: starting > 060515 134738 Indexer: linkdb: linkdb > 060515 134738 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/hadoop-default.xml > 060515 134738 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml > 060515 134738 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/mapred-default.xml > 060515 134738 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/mapred-default.xml > 060515 134738 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml > 060515 134738 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060515 134738 Indexer: adding segment: test/segments/20060511101525 > 060515 134738 Client connection to 10.234.57.38:9010: starting > 060515 134738 Client connection to 10.234.57.38:9011: starting > 060515 134739 parsing > jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/hadoop-default.xml > 060515 134739 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml > 060515 134739 Running job: job_0006 > 060515 134741 map 0% reduce 0% > 060515 134758 map 11% reduce 0% > 060515 134808 map 18% reduce 0% > 060515 134818 map 25% reduce 0% > 060515 134827 map 38% reduce 2% > 060515 134837 map 44% reduce 2% > 060515 134847 map 50% reduce 9% > 060515 134857 map 53% reduce 11% > 060515 134908 map 59% reduce 13% > 060515 134918 map 66% reduce 13% > 060515 134928 map 71% reduce 13% > 060515 134938 map 74% reduce 13% > 060515 134948 map 88% reduce 16% > 060515 134957 map 94% reduce 17% > 060515 135007 map 100% reduce 22% > 060515 135017 map 100% reduce 50% > 060515 135028 map 100% reduce 78% > 060515 135038 map 100% reduce 82% > 060515 135048 map 100% reduce 87% > 060515 135058 map 100% reduce 92% > 060515 135108 map 100% reduce 97% > 060515 135117 map 100% reduce 99% > 060515 135118 map 100% reduce 100% > 060515 135129 Job complete: job_0006 > 060515 135129 Indexer: done > > 24) Try Searching the engine using nutch itself > > Nutch looks for index and segements subdirectory of dfs in the directory > defined by th searcher.dir property. > edit the /nutch-site.xml and add the following lines: > > <property> > <name>searcher.dir</name> > <value>test</value> > <description> > Path to root of crawl. This directory is searched (in order) > for either the file search-servers.txt, containing a list of > distributed search servers, or the directory "index" containing > merged indexes, or the directory "segments" containing segment > indexes. > </description> > </property> > > This is where search look for stuff as explained in description. > Now run the search using nutch itself, > > Example : > > /opt/nutch/bin/nutch org.apache.nutch.searcher.NutchBean developpement > > 26) Search the engine using the brawser. > > To search you need to have tomcat installed and put the nutch war file > into tomcat servlet container. > I have build and installed tomcat as /opt/tomcat > > Note: (important) > Something interesting to note about the distributed filesystem is > that it is user specific. > If you store a directory urls under the filesystem with the nutch > user, it is actually stored as /user/nutch/urls. > What this means to us is that the user that does the crawl and > stores it in the distributed filesystem > must also be the user that starts the search, or no results will > come back. > You can try this yourself by logging in with a different user and > runing the ls command. > It won't find the directories because is it looking under a > different directory /user/username instead of /user/nutch > > As explained above we need to run tomcat as nutch user in order to be sure > to have search results; Be sure to have write > permission to nutch logs directory and read permission on the rest of the > tomcat installation: > > login as root > chmod -R ugo+rx /opt/nutch > chmod -R ugo+rwx /opt/nutch/logs > > export CATALINA_OPTS="-server -Xss256k -Xms768m -Xmx768m > -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true" > > rm -rf /opt/tomcat/webapps/ROOT* > cp /opt/nutch/nutch*.war /opt/tomcat/webapps/ROOT.war > /opt/tomcat/bin/startup.sh > > this should create a new webapps/ROOT rootdir > > We now have to ensure that the webapp (tomcat) can find the index and > segments. > Tomcat webapp will use the nutch configuration file under > > /opt/tomcat/webapps/ROOT/WEB-INF/classes > > copy in here your modified nutch configuration files from nutch/conf > directory: > > cp /opt/nutch/conf/hadoop-site.xml > /opt/tomcat/webapps/ROOT/WEB-INF/classes//hadoop-site.xml > cp /opt/nutch/conf/hadoop-env.sh > /opt/tomcat/webapps/ROOT/WEB-INF/classes//hadoop-env.sh > cp /opt/nutch/conf/nutch-site.xml > /opt/tomcat/webapps/ROOT/WEB-INF/classes//hadoop-site.xml > > now you will need to restart tomcat and enter the following URL into your > brawser: > > http://localhost:8080 > > the nutch search page should appear > > 27) RECRAWLING > > Now that everything works we update our db with new URLS > > A) we create the fetch list with the to 100 scoring pages in the current DB > > bin/nutch generate test/crawldb test/segments -topN 100 > > this has generated the new segment : test/segments/20060516135945 > > B) Now we fetch the new pages > > bin/nutch fetch test/segments/20060516135945 > > C) The DB is now updated with the entries of the new pages > > bin/nutch updatedb test/crawldb test/segments/20060516135945 > > D) We now we inver links. I guess the I could have just invert links on > test/segments/20060516135945 > but here I do it on all segments > > bin/nutch invertlinks linkdb -dir test/segments > > E) Remove the test/indexes directory > > hadoop dfs -rm test/indexes > > F) Now we recreate indexes > > nutch index test/indexes test/crawldb linkdb > test/segments/20060511101525 test/segments/20060516135945 > > G) DEDUP > > bin/nutch dedup test/indexes > > H) Merge indexes > > bin/nutch merge test/index test/indexes > > I) Now if you would like you can evene remove test/indexes > > > I have also tried to index segment in separate indexes directory like : > > nutch index test/indexes1 linkdb test/segments/20060511101525 > nutch index test/indexes2 linkdb test/segments/20060516135945 > bin/nutch merge test/index test/indexes1 test/indexes2 > > it looks is working and this will avoid to index each segment all time > we will instead index just the new segment and we just have to regenerate > the new meeged > index > > > Another solution for merging coulde have been to index each segment into a > different index directory: > > nutch index indexe1 test/crawldb linkdb test/segments/20060511101525 > nutch index indexe2 test/crawldb linkdb test/segments/20060516135945 > nutch merge test/index test/indexe1 test/indexe2 > > Another solution again is to merge the segment and index only the > resulting merged segment > but so far I did'nt succeed in doing so. > > > > # > # > #nutch crawl dmoz/urls -dir crawl-tinysite -depth 10 > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
