Set any name. Read the Nutch manual for more information. Alex
2008/8/5 brainstorm <[EMAIL PROTECTED]> > fatal error regarding http.robots.agents > > You should check or configure the following properties on > nutch-site.xml properly: > > <name>http.max.delays</name> > <name>http.robots.agents</name> > <name>http.agent.name</name> > <name>http.agent.description</name> > <name>http.agent.url</name> > <name>http.agent.email</name> > > > On Tue, Aug 5, 2008 at 8:56 AM, Alexander Aristov > <[EMAIL PROTECTED]> wrote: > > Do you have proxy in your network? > > > > 2008/8/5 Mohammad Monirul Hoque <[EMAIL PROTECTED]> > > > >> > >> Hi, > >> > >> What i only modify in crawl-urlfilter.txt is to add the line > >> > >> +^http://([a-z0-9]*\.)*wikipedia.org/ > >> > >> I also commented out the previous line like the following: > >> > >> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > >> > >> I also tried many other urls but each time it returned same type > of > >> result. > >> > >> Another imp things : I am trying nutch on ubuntu now which is showing > >> problem but when i used it in fedora core 8 it just worked fine. > >> > >> I was trying previously on pseudo-distributed mode but after having > >> problem i tried yesterday in stand-alone mode it returned same type of > >> result. > >> > >> When i see the hadoop.log it indicates that lots of pages were being > >> fetched with lots of error, fatal error regarding > http.robots.agents, > >> parser not found, java.net.SocketTimeOut exection etc. > >> > >> Pls tell me where i m wrong. > >> > >> regards, > >> --monirul > >> > >> > >> > >> > >> ----- Original Message ---- > >> From: Tristan Buckner <[EMAIL PROTECTED]> > >> To: [email protected] > >> Sent: Tuesday, August 5, 2008 12:46:21 AM > >> Subject: Re: problem in crawling > >> > >> Are your urls of the form > >> http://en.wikipedia.org/w/wiki.phtml?title=_&curid=foo > >> ? If it does the robots file excludes these. > >> > >> Also is there a line above that line for which the urls fail? > >> > >> On Aug 4, 2008, at 11:37 AM, Mohammad Monirul Hoque wrote: > >> > >> > Hi, > >> > > >> > Thanks for ur reply. In my crawl-urlfilter.txt i included the > >> > following line > >> > > >> > +^http://([a-z0-9]*\.)*wikipedia.org/ as i want to crawl wiki. > >> > > >> > My urls/urllist.txt contains urls of wikipedia like below: > >> > > >> > http://en.wikipedia.org/ > >> > > >> > I used nutch 0.9 previously in fedora 8.It worked fine. > >> > > >> > So pls tell me if u have any idea. > >> > > >> > best regards, > >> > > >> > --monirul > >> > > >> > > >> > > >> > > >> > ----- Original Message ---- > >> > From: Alexander Aristov <[EMAIL PROTECTED]> > >> > To: [email protected] > >> > Sent: Monday, August 4, 2008 1:28:58 PM > >> > Subject: Re: problem in crawling > >> > > >> > Hi > >> > > >> > what is in your crawl -urlfilter.txt file? > >> > > >> > Did you include your URLs in the filter? By default all urls are > >> > excluded. > >> > > >> > Alexander > >> > > >> > 2008/8/3 Mohammad Monirul Hoque <[EMAIL PROTECTED]> > >> > > >> >> Hi, > >> >> > >> >> I m using nutch 0.9 on ubuntu on a single machine with pseudo- > >> >> distributed > >> >> mode. > >> >> When i executing the following command > >> >> > >> >> bin/nutch crawl urls -dir crawled -depth 10 > >> >> > >> >> this is what i got from the hadoop log: > >> >> > >> >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - crawl started in: crawled > >> >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - rootUrlDir = urls > >> >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - threads = 10 > >> >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - depth = 10 > >> >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: starting > >> >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: crawlDb: > >> >> crawled/crawldb > >> >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: urlDir: urls > >> >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: Converting > >> >> injected urls to crawl db entries. > >> >> 2008-08-03 03:10:35,227 INFO crawl.Injector - Injector: Merging > >> >> injected > >> >> urls into crawl db. > >> >> 2008-08-03 03:10:59,724 INFO crawl.Injector - Injector: done > >> >> 2008-08-03 03:11:00,791 INFO crawl.Generator - Generator: Selecting > >> >> best-scoring urls due for fetch. > >> >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: starting > >> >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: segment: > >> >> crawled/segments/20080803031100 > >> >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: > >> >> filtering: false > >> >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: topN: > >> >> 2147483647 > >> >> 2008-08-03 03:11:24,239 INFO crawl.Generator - Generator: > >> >> Partitioning > >> >> selected urls by host, for politeness. > >> >> 2008-08-03 03:11:47,583 INFO crawl.Generator - Generator: done. > >> >> 2008-08-03 03:11:47,583 INFO fetcher.Fetcher - Fetcher: starting > >> >> 2008-08-03 03:11:47,583 INFO fetcher.Fetcher - Fetcher: segment: > >> >> crawled/segments/20080803031100 > >> >> 2008-08-03 03:12:36,915 INFO fetcher.Fetcher - Fetcher: done > >> >> 2008-08-03 03:12:36,951 INFO crawl.CrawlDb - CrawlDb update: > >> >> starting > >> >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: db: > >> >> crawled/crawldb > >> >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: > >> >> segments: > >> >> [crawled/segments/20080803031100] > >> >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: > >> >> additions > >> >> allowed: true > >> >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: URL > >> >> normalizing: true > >> >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: URL > >> >> filtering: true > >> >> 2008-08-03 03:12:36,967 INFO crawl.CrawlDb - CrawlDb update: Merging > >> >> segment data into db. > >> >> 2008-08-03 03:13:20,341 INFO crawl.CrawlDb - CrawlDb update: done > >> >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: Selecting > >> >> best-scoring urls due for fetch. > >> >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: starting > >> >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: segment: > >> >> crawled/segments/20080803031321 > >> >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: > >> >> filtering: false > >> >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: topN: > >> >> 2147483647 > >> >> 2008-08-03 03:13:39,667 INFO crawl.Generator - Generator: > >> >> Partitioning > >> >> selected urls by host, for politeness. > >> >> 2008-08-03 03:14:04,963 INFO crawl.Generator - Generator: done. > >> >> 2008-08-03 03:14:04,963 INFO fetcher.Fetcher - Fetcher: starting > >> >> 2008-08-03 03:14:04,963 INFO fetcher.Fetcher - Fetcher: segment: > >> >> crawled/segments/20080803031321 > >> >> 2008-08-03 03:21:26,809 INFO fetcher.Fetcher - Fetcher: done > >> >> 2008-08-03 03:21:26,851 INFO crawl.CrawlDb - CrawlDb update: > >> >> starting > >> >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: db: > >> >> crawled/crawldb > >> >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: > >> >> segments: > >> >> [crawled/segments/20080803031321] > >> >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: > >> >> additions > >> >> allowed: true > >> >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: URL > >> >> normalizing: true > >> >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: URL > >> >> filtering: true > >> >> 2008-08-03 03:21:26,866 INFO crawl.CrawlDb - CrawlDb update: Merging > >> >> segment data into db. > >> >> 2008-08-03 03:22:13,223 INFO crawl.CrawlDb - CrawlDb update: done > >> >> 2008-08-03 03:22:14,251 INFO crawl.Generator - Generator: Selecting > >> >> best-scoring urls due for fetch. > >> >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: starting > >> >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: segment: > >> >> crawled/segments/20080803032214 > >> >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: > >> >> filtering: false > >> >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: topN: > >> >> 2147483647 > >> >> 2008-08-03 03:22:34,459 INFO crawl.Generator - Generator: > >> >> Partitioning > >> >> selected urls by host, for politeness. > >> >> 2008-08-03 03:22:59,733 INFO crawl.Generator - Generator: done. > >> >> 2008-08-03 03:22:59,734 INFO fetcher.Fetcher - Fetcher: starting > >> >> 2008-08-03 03:22:59,734 INFO fetcher.Fetcher - Fetcher: segment: > >> >> crawled/segments/20080803032214 > >> >> 2008-08-03 04:24:53,193 INFO fetcher.Fetcher - Fetcher: done > >> >> > >> >> What i found executing the command: > >> >> bin/hadoop dfs -ls > >> >> Found 2 items > >> >> /user/nutch/crawled <dir> > >> >> /user/nutch/urls <dir> > >> >> $ bin/hadoop dfs -ls crawled > >> >> Found 2 items > >> >> /user/nutch/crawled/crawldb <dir> > >> >> /user/nutch/crawled/segments <dir> > >> >> > >> >> Where is linkdb,indexes and index? So pls tell me which may be the > >> >> error. > >> >> > >> >> Here is my hadoop-site.xml: > >> >> > >> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > >> >> > >> >> <!-- Put site-specific property overrides in this file. --> > >> >> > >> >> <configuration> > >> >> <property> > >> >> <name>fs.default.name</name> > >> >> <value>sysmonitor:9000</value> > >> >> <description> > >> >> The name of the default file system. Either the literal string > >> >> "local" or a host:port for NDFS. > >> >> </description> > >> >> </property> > >> >> <property> > >> >> <name>mapred.job.tracker</name> > >> >> <value>sysmonitor:9001</value> > >> >> <description> > >> >> The host and port that the MapReduce job tracker runs at. If > >> >> "local", then jobs are run in-process as a single map and > >> >> reduce task. > >> >> </description> > >> >> </property> > >> >> <property> > >> >> <name>mapred.tasktracker.tasks.maximum</name> > >> >> <value>2</value> > >> >> <description> > >> >> The maximum number of tasks that will be run simultaneously by > >> >> a task tracker. This should be adjusted according to the heap size > >> >> per task, the amount of RAM available, and CPU consumption of > >> >> each task. > >> >> </description> > >> >> </property> > >> >> <property> > >> >> <name>mapred.child.java.opts</name> > >> >> <value>-Xmx200m</value> > >> >> <description> > >> >> You can specify other Java options for each map or reduce task > >> >> here, > >> >> but most likely you will want to adjust the heap size. > >> >> </description> > >> >> </property> > >> >> <property> > >> >> <name>dfs.name.dir</name> > >> >> <value>/nutch/filesystem/name</value> > >> >> </property> > >> >> <property> > >> >> <name>dfs.data.dir</name> > >> >> <value>/nutch/filesystem/data</value> > >> >> </property> > >> >> > >> >> <property> > >> >> <name>mapred.system.dir</name> > >> >> <value>/nutch/filesystem/mapreduce/system</value> > >> >> </property> > >> >> <property> > >> >> <name>mapred.local.dir</name> > >> >> <value>/nutch/filesystem/mapreduce/local</value> > >> >> </property> > >> >> > >> >> <property> > >> >> <name>dfs.replication</name> > >> >> <value>1</value> > >> >> </property> > >> >> </configuration> > >> >> > >> >> > >> >> My urls/urllist.txt contains almost 100 seed urls and depth is 10 > >> >> but it > >> >> seems there is little crawling done. > >> >> > >> >> > >> >> regards > >> >> --monirul > >> >> > >> >> > >> >> > >> > > >> > > >> > > >> > > >> > -- > >> > Best Regards > >> > Alexander Aristov > >> > > >> > > >> > > >> > >> > >> > >> > > > > > > > > -- > > Best Regards > > Alexander Aristov > > > -- Best Regards Alexander Aristov
