Do you have proxy in your network? 2008/8/5 Mohammad Monirul Hoque <[EMAIL PROTECTED]>
> > Hi, > > What i only modify in crawl-urlfilter.txt is to add the line > > +^http://([a-z0-9]*\.)*wikipedia.org/ > > I also commented out the previous line like the following: > > #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > > I also tried many other urls but each time it returned same type of > result. > > Another imp things : I am trying nutch on ubuntu now which is showing > problem but when i used it in fedora core 8 it just worked fine. > > I was trying previously on pseudo-distributed mode but after having > problem i tried yesterday in stand-alone mode it returned same type of > result. > > When i see the hadoop.log it indicates that lots of pages were being > fetched with lots of error, fatal error regarding http.robots.agents, > parser not found, java.net.SocketTimeOut exection etc. > > Pls tell me where i m wrong. > > regards, > --monirul > > > > > ----- Original Message ---- > From: Tristan Buckner <[EMAIL PROTECTED]> > To: [email protected] > Sent: Tuesday, August 5, 2008 12:46:21 AM > Subject: Re: problem in crawling > > Are your urls of the form > http://en.wikipedia.org/w/wiki.phtml?title=_&curid=foo > ? If it does the robots file excludes these. > > Also is there a line above that line for which the urls fail? > > On Aug 4, 2008, at 11:37 AM, Mohammad Monirul Hoque wrote: > > > Hi, > > > > Thanks for ur reply. In my crawl-urlfilter.txt i included the > > following line > > > > +^http://([a-z0-9]*\.)*wikipedia.org/ as i want to crawl wiki. > > > > My urls/urllist.txt contains urls of wikipedia like below: > > > > http://en.wikipedia.org/ > > > > I used nutch 0.9 previously in fedora 8.It worked fine. > > > > So pls tell me if u have any idea. > > > > best regards, > > > > --monirul > > > > > > > > > > ----- Original Message ---- > > From: Alexander Aristov <[EMAIL PROTECTED]> > > To: [email protected] > > Sent: Monday, August 4, 2008 1:28:58 PM > > Subject: Re: problem in crawling > > > > Hi > > > > what is in your crawl -urlfilter.txt file? > > > > Did you include your URLs in the filter? By default all urls are > > excluded. > > > > Alexander > > > > 2008/8/3 Mohammad Monirul Hoque <[EMAIL PROTECTED]> > > > >> Hi, > >> > >> I m using nutch 0.9 on ubuntu on a single machine with pseudo- > >> distributed > >> mode. > >> When i executing the following command > >> > >> bin/nutch crawl urls -dir crawled -depth 10 > >> > >> this is what i got from the hadoop log: > >> > >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - crawl started in: crawled > >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - rootUrlDir = urls > >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - threads = 10 > >> 2008-08-03 03:10:17,392 INFO crawl.Crawl - depth = 10 > >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: starting > >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: crawlDb: > >> crawled/crawldb > >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: urlDir: urls > >> 2008-08-03 03:10:17,461 INFO crawl.Injector - Injector: Converting > >> injected urls to crawl db entries. > >> 2008-08-03 03:10:35,227 INFO crawl.Injector - Injector: Merging > >> injected > >> urls into crawl db. > >> 2008-08-03 03:10:59,724 INFO crawl.Injector - Injector: done > >> 2008-08-03 03:11:00,791 INFO crawl.Generator - Generator: Selecting > >> best-scoring urls due for fetch. > >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: starting > >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: segment: > >> crawled/segments/20080803031100 > >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: > >> filtering: false > >> 2008-08-03 03:11:00,792 INFO crawl.Generator - Generator: topN: > >> 2147483647 > >> 2008-08-03 03:11:24,239 INFO crawl.Generator - Generator: > >> Partitioning > >> selected urls by host, for politeness. > >> 2008-08-03 03:11:47,583 INFO crawl.Generator - Generator: done. > >> 2008-08-03 03:11:47,583 INFO fetcher.Fetcher - Fetcher: starting > >> 2008-08-03 03:11:47,583 INFO fetcher.Fetcher - Fetcher: segment: > >> crawled/segments/20080803031100 > >> 2008-08-03 03:12:36,915 INFO fetcher.Fetcher - Fetcher: done > >> 2008-08-03 03:12:36,951 INFO crawl.CrawlDb - CrawlDb update: > >> starting > >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: db: > >> crawled/crawldb > >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: > >> segments: > >> [crawled/segments/20080803031100] > >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: > >> additions > >> allowed: true > >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: URL > >> normalizing: true > >> 2008-08-03 03:12:36,952 INFO crawl.CrawlDb - CrawlDb update: URL > >> filtering: true > >> 2008-08-03 03:12:36,967 INFO crawl.CrawlDb - CrawlDb update: Merging > >> segment data into db. > >> 2008-08-03 03:13:20,341 INFO crawl.CrawlDb - CrawlDb update: done > >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: Selecting > >> best-scoring urls due for fetch. > >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: starting > >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: segment: > >> crawled/segments/20080803031321 > >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: > >> filtering: false > >> 2008-08-03 03:13:21,374 INFO crawl.Generator - Generator: topN: > >> 2147483647 > >> 2008-08-03 03:13:39,667 INFO crawl.Generator - Generator: > >> Partitioning > >> selected urls by host, for politeness. > >> 2008-08-03 03:14:04,963 INFO crawl.Generator - Generator: done. > >> 2008-08-03 03:14:04,963 INFO fetcher.Fetcher - Fetcher: starting > >> 2008-08-03 03:14:04,963 INFO fetcher.Fetcher - Fetcher: segment: > >> crawled/segments/20080803031321 > >> 2008-08-03 03:21:26,809 INFO fetcher.Fetcher - Fetcher: done > >> 2008-08-03 03:21:26,851 INFO crawl.CrawlDb - CrawlDb update: > >> starting > >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: db: > >> crawled/crawldb > >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: > >> segments: > >> [crawled/segments/20080803031321] > >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: > >> additions > >> allowed: true > >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: URL > >> normalizing: true > >> 2008-08-03 03:21:26,852 INFO crawl.CrawlDb - CrawlDb update: URL > >> filtering: true > >> 2008-08-03 03:21:26,866 INFO crawl.CrawlDb - CrawlDb update: Merging > >> segment data into db. > >> 2008-08-03 03:22:13,223 INFO crawl.CrawlDb - CrawlDb update: done > >> 2008-08-03 03:22:14,251 INFO crawl.Generator - Generator: Selecting > >> best-scoring urls due for fetch. > >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: starting > >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: segment: > >> crawled/segments/20080803032214 > >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: > >> filtering: false > >> 2008-08-03 03:22:14,252 INFO crawl.Generator - Generator: topN: > >> 2147483647 > >> 2008-08-03 03:22:34,459 INFO crawl.Generator - Generator: > >> Partitioning > >> selected urls by host, for politeness. > >> 2008-08-03 03:22:59,733 INFO crawl.Generator - Generator: done. > >> 2008-08-03 03:22:59,734 INFO fetcher.Fetcher - Fetcher: starting > >> 2008-08-03 03:22:59,734 INFO fetcher.Fetcher - Fetcher: segment: > >> crawled/segments/20080803032214 > >> 2008-08-03 04:24:53,193 INFO fetcher.Fetcher - Fetcher: done > >> > >> What i found executing the command: > >> bin/hadoop dfs -ls > >> Found 2 items > >> /user/nutch/crawled <dir> > >> /user/nutch/urls <dir> > >> $ bin/hadoop dfs -ls crawled > >> Found 2 items > >> /user/nutch/crawled/crawldb <dir> > >> /user/nutch/crawled/segments <dir> > >> > >> Where is linkdb,indexes and index? So pls tell me which may be the > >> error. > >> > >> Here is my hadoop-site.xml: > >> > >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > >> > >> <!-- Put site-specific property overrides in this file. --> > >> > >> <configuration> > >> <property> > >> <name>fs.default.name</name> > >> <value>sysmonitor:9000</value> > >> <description> > >> The name of the default file system. Either the literal string > >> "local" or a host:port for NDFS. > >> </description> > >> </property> > >> <property> > >> <name>mapred.job.tracker</name> > >> <value>sysmonitor:9001</value> > >> <description> > >> The host and port that the MapReduce job tracker runs at. If > >> "local", then jobs are run in-process as a single map and > >> reduce task. > >> </description> > >> </property> > >> <property> > >> <name>mapred.tasktracker.tasks.maximum</name> > >> <value>2</value> > >> <description> > >> The maximum number of tasks that will be run simultaneously by > >> a task tracker. This should be adjusted according to the heap size > >> per task, the amount of RAM available, and CPU consumption of > >> each task. > >> </description> > >> </property> > >> <property> > >> <name>mapred.child.java.opts</name> > >> <value>-Xmx200m</value> > >> <description> > >> You can specify other Java options for each map or reduce task > >> here, > >> but most likely you will want to adjust the heap size. > >> </description> > >> </property> > >> <property> > >> <name>dfs.name.dir</name> > >> <value>/nutch/filesystem/name</value> > >> </property> > >> <property> > >> <name>dfs.data.dir</name> > >> <value>/nutch/filesystem/data</value> > >> </property> > >> > >> <property> > >> <name>mapred.system.dir</name> > >> <value>/nutch/filesystem/mapreduce/system</value> > >> </property> > >> <property> > >> <name>mapred.local.dir</name> > >> <value>/nutch/filesystem/mapreduce/local</value> > >> </property> > >> > >> <property> > >> <name>dfs.replication</name> > >> <value>1</value> > >> </property> > >> </configuration> > >> > >> > >> My urls/urllist.txt contains almost 100 seed urls and depth is 10 > >> but it > >> seems there is little crawling done. > >> > >> > >> regards > >> --monirul > >> > >> > >> > > > > > > > > > > -- > > Best Regards > > Alexander Aristov > > > > > > > > > > -- Best Regards Alexander Aristov
