Hi, What i only modify in crawl-urlfilter.txt is to add the line
+^http://([a-z0-9]*\.)*wikipedia.org/ I also commented out the previous line like the following: #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ I also tried many other urls but each time it returned same type of result. Another imp things : I am trying nutch on ubuntu now which is showing problem but when i used it in fedora core 8 it just worked fine. I was trying previously on pseudo-distributed mode but after having problem i tried yesterday in stand-alone mode it returned same type of result. When i see the hadoop.log it indicates that lots of pages were being fetched with lots of error, fatal error regarding http.robots.agents, parser not found, java.net.SocketTimeOut exection etc. Pls tell me where i m wrong. regards, --monirul ----- Original Message ---- From: Tristan Buckner <[EMAIL PROTECTED]> To: [email protected] Sent: Tuesday, August 5, 2008 12:46:21 AM Subject: Re: problem in crawling Are your urls of the form http://en.wikipedia.org/w/wiki.phtml?title=_&curid=foo ? If it does the robots file excludes these. Also is there a line above that line for which the urls fail? On Aug 4, 2008, at 11:37 AM, Mohammad Monirul Hoque wrote: > Hi, > > Thanks for ur reply. In my crawl-urlfilter.txt i included the > following line > > +^http://([a-z0-9]*\.)*wikipedia.org/ as i want to crawl wiki. > > My urls/urllist.txt contains urls of wikipedia like below: > > http://en.wikipedia.org/ > > I used nutch 0.9 previously in fedora 8.It worked fine. > > So pls tell me if u have any idea. > > best regards, > > --monirul > > > > > ----- Original Message ---- > From: Alexander Aristov <[EMAIL PROTECTED]> > To: [email protected] > Sent: Monday, August 4, 2008 1:28:58 PM > Subject: Re: problem in crawling > > Hi > > what is in your crawl -urlfilter.txt file? > > Did you include your URLs in the filter? By default all urls are > excluded. > > Alexander > > 2008/8/3 Mohammad Monirul Hoque <[EMAIL PROTECTED]> > >> Hi, >> >> I m using nutch 0.9 on ubuntu on a single machine with pseudo- >> distributed >> mode. >> When i executing the following command >> >> bin/nutch crawl urls -dir crawled -depth 10 >> >> this is what i got from the hadoop log: >> >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - crawl started in: crawled >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - rootUrlDir = urls >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - threads = 10 >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - depth = 10 >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: starting >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: crawlDb: >> crawled/crawldb >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: urlDir: urls >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: Converting >> injected urls to crawl db entries. >> 2008-08-03 03:10:35,227 INFO crawl.Injector - Injector: Merging >> injected >> urls into crawl db. >> 2008-08-03 03:10:59,724 INFO crawl.Injector - Injector: done >> 2008-08-03 03:11:00,791 INFO crawl.Generator - Generator: Selecting >> best-scoring urls due for fetch. >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: starting >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: segment: >> crawled/segments/20080803031100 >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: >> filtering: false >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: topN: >> 2147483647 >> 2008-08-03 03:11:24,239 INFO crawl.Generator - Generator: >> Partitioning >> selected urls by host, for politeness. >> 2008-08-03 03:11:47,583 INFO crawl.Generator - Generator: done. >> 2008-08-03 03:11:47,583 INFO fetcher.Fetcher - Fetcher: starting >> 2008-08-03 03:11:47,583 INFO fetcher.Fetcher - Fetcher: segment: >> crawled/segments/20080803031100 >> 2008-08-03 03:12:36,915 INFO fetcher.Fetcher - Fetcher: done >> 2008-08-03 03:12:36,951 INFO crawl.CrawlDb - CrawlDb update: >> starting >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: db: >> crawled/crawldb >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: >> segments: >> [crawled/segments/20080803031100] >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: >> additions >> allowed: true >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: URL >> normalizing: true >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: URL >> filtering: true >> 2008-08-03 03:12:36,967 INFO crawl.CrawlDb - CrawlDb update: Merging >> segment data into db. >> 2008-08-03 03:13:20,341 INFO crawl.CrawlDb - CrawlDb update: done >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: Selecting >> best-scoring urls due for fetch. >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: starting >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: segment: >> crawled/segments/20080803031321 >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: >> filtering: false >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: topN: >> 2147483647 >> 2008-08-03 03:13:39,667 INFO crawl.Generator - Generator: >> Partitioning >> selected urls by host, for politeness. >> 2008-08-03 03:14:04,963 INFO crawl.Generator - Generator: done. >> 2008-08-03 03:14:04,963 INFO fetcher.Fetcher - Fetcher: starting >> 2008-08-03 03:14:04,963 INFO fetcher.Fetcher - Fetcher: segment: >> crawled/segments/20080803031321 >> 2008-08-03 03:21:26,809 INFO fetcher.Fetcher - Fetcher: done >> 2008-08-03 03:21:26,851 INFO crawl.CrawlDb - CrawlDb update: >> starting >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: db: >> crawled/crawldb >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: >> segments: >> [crawled/segments/20080803031321] >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: >> additions >> allowed: true >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: URL >> normalizing: true >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: URL >> filtering: true >> 2008-08-03 03:21:26,866 INFO crawl.CrawlDb - CrawlDb update: Merging >> segment data into db. >> 2008-08-03 03:22:13,223 INFO crawl.CrawlDb - CrawlDb update: done >> 2008-08-03 03:22:14,251 INFO crawl.Generator - Generator: Selecting >> best-scoring urls due for fetch. >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: starting >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: segment: >> crawled/segments/20080803032214 >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: >> filtering: false >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: topN: >> 2147483647 >> 2008-08-03 03:22:34,459 INFO crawl.Generator - Generator: >> Partitioning >> selected urls by host, for politeness. >> 2008-08-03 03:22:59,733 INFO crawl.Generator - Generator: done. >> 2008-08-03 03:22:59,734 INFO fetcher.Fetcher - Fetcher: starting >> 2008-08-03 03:22:59,734 INFO fetcher.Fetcher - Fetcher: segment: >> crawled/segments/20080803032214 >> 2008-08-03 04:24:53,193 INFO fetcher.Fetcher - Fetcher: done >> >> What i found executing the command: >> bin/hadoop dfs -ls >> Found 2 items >> /user/nutch/crawled <dir> >> /user/nutch/urls <dir> >> $ bin/hadoop dfs -ls crawled >> Found 2 items >> /user/nutch/crawled/crawldb <dir> >> /user/nutch/crawled/segments <dir> >> >> Where is linkdb,indexes and index? So pls tell me which may be the >> error. >> >> Here is my hadoop-site.xml: >> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >> >> <!-- Put site-specific property overrides in this file. --> >> >> <configuration> >> <property> >> <name>fs.default.name</name> >> <value>sysmonitor:9000</value> >> <description> >> The name of the default file system. Either the literal string >> "local" or a host:port for NDFS. >> </description> >> </property> >> <property> >> <name>mapred.job.tracker</name> >> <value>sysmonitor:9001</value> >> <description> >> The host and port that the MapReduce job tracker runs at. If >> "local", then jobs are run in-process as a single map and >> reduce task. >> </description> >> </property> >> <property> >> <name>mapred.tasktracker.tasks.maximum</name> >> <value>2</value> >> <description> >> The maximum number of tasks that will be run simultaneously by >> a task tracker. This should be adjusted according to the heap size >> per task, the amount of RAM available, and CPU consumption of >> each task. >> </description> >> </property> >> <property> >> <name>mapred.child.java.opts</name> >> <value>-Xmx200m</value> >> <description> >> You can specify other Java options for each map or reduce task >> here, >> but most likely you will want to adjust the heap size. >> </description> >> </property> >> <property> >> <name>dfs.name.dir</name> >> <value>/nutch/filesystem/name</value> >> </property> >> <property> >> <name>dfs.data.dir</name> >> <value>/nutch/filesystem/data</value> >> </property> >> >> <property> >> <name>mapred.system.dir</name> >> <value>/nutch/filesystem/mapreduce/system</value> >> </property> >> <property> >> <name>mapred.local.dir</name> >> <value>/nutch/filesystem/mapreduce/local</value> >> </property> >> >> <property> >> <name>dfs.replication</name> >> <value>1</value> >> </property> >> </configuration> >> >> >> My urls/urllist.txt contains almost 100 seed urls and depth is 10 >> but it >> seems there is little crawling done. >> >> >> regards >> --monirul >> >> >> > > > > > -- > Best Regards > Alexander Aristov > > >
