Hi, Thanks for ur reply. In my crawl-urlfilter.txt i included the following line
+^http://([a-z0-9]*\.)*wikipedia.org/ as i want to crawl wiki. My urls/urllist.txt contains urls of wikipedia like below: http://en.wikipedia.org/ I used nutch 0.9 previously in fedora 8.It worked fine. So pls tell me if u have any idea. best regards, --monirul ----- Original Message ---- From: Alexander Aristov <[EMAIL PROTECTED]> To: [email protected] Sent: Monday, August 4, 2008 1:28:58 PM Subject: Re: problem in crawling Hi what is in your crawl -urlfilter.txt file? Did you include your URLs in the filter? By default all urls are excluded. Alexander 2008/8/3 Mohammad Monirul Hoque <[EMAIL PROTECTED]> > Hi, > > I m using nutch 0.9 on ubuntu on a single machine with pseudo-distributed > mode. > When i executing the following command > > bin/nutch crawl urls -dir crawled -depth 10 > > this is what i got from the hadoop log: > > 2008-08-03 03:10:17,392 INFO crawl.Crawl - crawl started in: crawled > 2008-08-03 03:10:17,392 INFO crawl.Crawl - rootUrlDir = urls > 2008-08-03 03:10:17,392 INFO crawl.Crawl - threads = 10 > 2008-08-03 03:10:17,392 INFO crawl.Crawl - depth = 10 > 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: starting > 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: crawlDb: > crawled/crawldb > 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: urlDir: urls > 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: Converting > injected urls to crawl db entries. > 2008-08-03 03:10:35,227 INFO crawl.Injector - Injector: Merging injected > urls into crawl db. > 2008-08-03 03:10:59,724 INFO crawl.Injector - Injector: done > 2008-08-03 03:11:00,791 INFO crawl.Generator - Generator: Selecting > best-scoring urls due for fetch. > 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: starting > 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: segment: > crawled/segments/20080803031100 > 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: filtering: false > 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: topN: 2147483647 > 2008-08-03 03:11:24,239 INFO crawl.Generator - Generator: Partitioning > selected urls by host, for politeness. > 2008-08-03 03:11:47,583 INFO crawl.Generator - Generator: done. > 2008-08-03 03:11:47,583 INFO fetcher.Fetcher - Fetcher: starting > 2008-08-03 03:11:47,583 INFO fetcher.Fetcher - Fetcher: segment: > crawled/segments/20080803031100 > 2008-08-03 03:12:36,915 INFO fetcher.Fetcher - Fetcher: done > 2008-08-03 03:12:36,951 INFO crawl.CrawlDb - CrawlDb update: starting > 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: db: > crawled/crawldb > 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: segments: > [crawled/segments/20080803031100] > 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: additions > allowed: true > 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: URL > normalizing: true > 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: URL > filtering: true > 2008-08-03 03:12:36,967 INFO crawl.CrawlDb - CrawlDb update: Merging > segment data into db. > 2008-08-03 03:13:20,341 INFO crawl.CrawlDb - CrawlDb update: done > 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: Selecting > best-scoring urls due for fetch. > 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: starting > 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: segment: > crawled/segments/20080803031321 > 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: filtering: false > 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: topN: 2147483647 > 2008-08-03 03:13:39,667 INFO crawl.Generator - Generator: Partitioning > selected urls by host, for politeness. > 2008-08-03 03:14:04,963 INFO crawl.Generator - Generator: done. > 2008-08-03 03:14:04,963 INFO fetcher.Fetcher - Fetcher: starting > 2008-08-03 03:14:04,963 INFO fetcher.Fetcher - Fetcher: segment: > crawled/segments/20080803031321 > 2008-08-03 03:21:26,809 INFO fetcher.Fetcher - Fetcher: done > 2008-08-03 03:21:26,851 INFO crawl.CrawlDb - CrawlDb update: starting > 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: db: > crawled/crawldb > 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: segments: > [crawled/segments/20080803031321] > 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: additions > allowed: true > 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: URL > normalizing: true > 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: URL > filtering: true > 2008-08-03 03:21:26,866 INFO crawl.CrawlDb - CrawlDb update: Merging > segment data into db. > 2008-08-03 03:22:13,223 INFO crawl.CrawlDb - CrawlDb update: done > 2008-08-03 03:22:14,251 INFO crawl.Generator - Generator: Selecting > best-scoring urls due for fetch. > 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: starting > 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: segment: > crawled/segments/20080803032214 > 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: filtering: false > 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: topN: 2147483647 > 2008-08-03 03:22:34,459 INFO crawl.Generator - Generator: Partitioning > selected urls by host, for politeness. > 2008-08-03 03:22:59,733 INFO crawl.Generator - Generator: done. > 2008-08-03 03:22:59,734 INFO fetcher.Fetcher - Fetcher: starting > 2008-08-03 03:22:59,734 INFO fetcher.Fetcher - Fetcher: segment: > crawled/segments/20080803032214 > 2008-08-03 04:24:53,193 INFO fetcher.Fetcher - Fetcher: done > > What i found executing the command: > bin/hadoop dfs -ls > Found 2 items > /user/nutch/crawled <dir> > /user/nutch/urls <dir> > $ bin/hadoop dfs -ls crawled > Found 2 items > /user/nutch/crawled/crawldb <dir> > /user/nutch/crawled/segments <dir> > > Where is linkdb,indexes and index? So pls tell me which may be the error. > > Here is my hadoop-site.xml: > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <configuration> > <property> > <name>fs.default.name</name> > <value>sysmonitor:9000</value> > <description> > The name of the default file system. Either the literal string > "local" or a host:port for NDFS. > </description> > </property> > <property> > <name>mapred.job.tracker</name> > <value>sysmonitor:9001</value> > <description> > The host and port that the MapReduce job tracker runs at. If > "local", then jobs are run in-process as a single map and > reduce task. > </description> > </property> > <property> > <name>mapred.tasktracker.tasks.maximum</name> > <value>2</value> > <description> > The maximum number of tasks that will be run simultaneously by > a task tracker. This should be adjusted according to the heap size > per task, the amount of RAM available, and CPU consumption of each task. > </description> > </property> > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx200m</value> > <description> > You can specify other Java options for each map or reduce task here, > but most likely you will want to adjust the heap size. > </description> > </property> > <property> > <name>dfs.name.dir</name> > <value>/nutch/filesystem/name</value> > </property> > <property> > <name>dfs.data.dir</name> > <value>/nutch/filesystem/data</value> > </property> > > <property> > <name>mapred.system.dir</name> > <value>/nutch/filesystem/mapreduce/system</value> > </property> > <property> > <name>mapred.local.dir</name> > <value>/nutch/filesystem/mapreduce/local</value> > </property> > > <property> > <name>dfs.replication</name> > <value>1</value> > </property> > </configuration> > > > My urls/urllist.txt contains almost 100 seed urls and depth is 10 but it > seems there is little crawling done. > > > regards > --monirul > > > -- Best Regards Alexander Aristov
