Re: problem in crawling

Alexander Aristov Tue, 05 Aug 2008 00:21:26 -0700

Set any name. Read the Nutch manual for more information.

Alex


2008/8/5 brainstorm <[EMAIL PROTECTED]>

> fatal error  regarding  http.robots.agents
>
> You should check or configure the following properties on
> nutch-site.xml properly:
>
>  <name>http.max.delays</name>
>  <name>http.robots.agents</name>
>  <name>http.agent.name</name>
>  <name>http.agent.description</name>
>  <name>http.agent.url</name>
>  <name>http.agent.email</name>
>
>
> On Tue, Aug 5, 2008 at 8:56 AM, Alexander Aristov
> <[EMAIL PROTECTED]> wrote:
> > Do you have proxy in your network?
> >
> > 2008/8/5 Mohammad Monirul Hoque <[EMAIL PROTECTED]>
> >
> >>
> >> Hi,
> >>
> >> What i only modify in crawl-urlfilter.txt is to add the line
> >>
> >> +^http://([a-z0-9]*\.)*wikipedia.org/
> >>
> >> I also commented out the previous line like the following:
> >>
> >> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> >>
> >> I also tried many other  urls  but  each time  it  returned  same  type
>  of
> >>  result.
> >>
> >> Another imp things : I am trying nutch on ubuntu now which is showing
> >> problem but when i used it in fedora core 8 it just worked fine.
> >>
> >> I was trying previously on pseudo-distributed  mode  but  after having
> >> problem i tried yesterday  in stand-alone mode it returned same type of
> >> result.
> >>
> >> When i see the hadoop.log it indicates that lots of pages were being
> >> fetched  with  lots of  error,  fatal error  regarding
>  http.robots.agents,
> >> parser not found, java.net.SocketTimeOut exection etc.
> >>
> >> Pls tell me where i m wrong.
> >>
> >> regards,
> >> --monirul
> >>
> >>
> >>
> >>
> >> ----- Original Message ----
> >> From: Tristan Buckner <[EMAIL PROTECTED]>
> >> To: [email protected]
> >> Sent: Tuesday, August 5, 2008 12:46:21 AM
> >> Subject: Re: problem in crawling
> >>
> >> Are your urls of the form
> >> http://en.wikipedia.org/w/wiki.phtml?title=_&curid=foo
> >>  ?  If it does the robots file excludes these.
> >>
> >> Also is there a line above that line for which the urls fail?
> >>
> >> On Aug 4, 2008, at 11:37 AM, Mohammad Monirul Hoque wrote:
> >>
> >> > Hi,
> >> >
> >> > Thanks for ur reply. In my crawl-urlfilter.txt i included the
> >> > following line
> >> >
> >> > +^http://([a-z0-9]*\.)*wikipedia.org/  as i want to crawl wiki.
> >> >
> >> > My urls/urllist.txt contains urls of wikipedia like below:
> >> >
> >> > http://en.wikipedia.org/
> >> >
> >> > I used nutch 0.9 previously in fedora 8.It worked fine.
> >> >
> >> > So pls tell me if u have any idea.
> >> >
> >> > best regards,
> >> >
> >> > --monirul
> >> >
> >> >
> >> >
> >> >
> >> > ----- Original Message ----
> >> > From: Alexander Aristov <[EMAIL PROTECTED]>
> >> > To: [email protected]
> >> > Sent: Monday, August 4, 2008 1:28:58 PM
> >> > Subject: Re: problem in crawling
> >> >
> >> > Hi
> >> >
> >> > what is in your crawl -urlfilter.txt file?
> >> >
> >> > Did you include your URLs in the filter? By default all urls are
> >> > excluded.
> >> >
> >> > Alexander
> >> >
> >> > 2008/8/3 Mohammad Monirul Hoque <[EMAIL PROTECTED]>
> >> >
> >> >> Hi,
> >> >>
> >> >> I m using nutch 0.9 on ubuntu on a single machine with pseudo-
> >> >> distributed
> >> >> mode.
> >> >> When i executing  the  following command
> >> >>
> >> >> bin/nutch crawl urls -dir crawled -depth 10
> >> >>
> >> >> this is what i got from the hadoop log:
> >> >>
> >> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - crawl started in: crawled
> >> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - rootUrlDir = urls
> >> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - threads = 10
> >> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - depth = 10
> >> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: starting
> >> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: crawlDb:
> >> >> crawled/crawldb
> >> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: urlDir: urls
> >> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: Converting
> >> >> injected urls to crawl db entries.
> >> >> 2008-08-03 03:10:35,227 INFO  crawl.Injector - Injector: Merging
> >> >> injected
> >> >> urls into crawl db.
> >> >> 2008-08-03 03:10:59,724 INFO  crawl.Injector - Injector: done
> >> >> 2008-08-03 03:11:00,791 INFO  crawl.Generator - Generator: Selecting
> >> >> best-scoring urls due for fetch.
> >> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: starting
> >> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: segment:
> >> >> crawled/segments/20080803031100
> >> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator:
> >> >> filtering: false
> >> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: topN:
> >> >> 2147483647
> >> >> 2008-08-03 03:11:24,239 INFO  crawl.Generator - Generator:
> >> >> Partitioning
> >> >> selected urls by host, for politeness.
> >> >> 2008-08-03 03:11:47,583 INFO  crawl.Generator - Generator: done.
> >> >> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: starting
> >> >> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: segment:
> >> >> crawled/segments/20080803031100
> >> >> 2008-08-03 03:12:36,915 INFO  fetcher.Fetcher - Fetcher: done
> >> >> 2008-08-03 03:12:36,951 INFO  crawl.CrawlDb - CrawlDb update:
> >> >> starting
> >> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: db:
> >> >> crawled/crawldb
> >> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:
> >> >> segments:
> >> >> [crawled/segments/20080803031100]
> >> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:
> >> >> additions
> >> >> allowed: true
> >> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
> >> >> normalizing: true
> >> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
> >> >> filtering: true
> >> >> 2008-08-03 03:12:36,967 INFO  crawl.CrawlDb - CrawlDb update: Merging
> >> >> segment data into db.
> >> >> 2008-08-03 03:13:20,341 INFO  crawl.CrawlDb - CrawlDb update: done
> >> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: Selecting
> >> >> best-scoring urls due for fetch.
> >> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: starting
> >> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: segment:
> >> >> crawled/segments/20080803031321
> >> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator:
> >> >> filtering: false
> >> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: topN:
> >> >> 2147483647
> >> >> 2008-08-03 03:13:39,667 INFO  crawl.Generator - Generator:
> >> >> Partitioning
> >> >> selected urls by host, for politeness.
> >> >> 2008-08-03 03:14:04,963 INFO  crawl.Generator - Generator: done.
> >> >> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: starting
> >> >> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: segment:
> >> >> crawled/segments/20080803031321
> >> >> 2008-08-03 03:21:26,809 INFO  fetcher.Fetcher - Fetcher: done
> >> >> 2008-08-03 03:21:26,851 INFO  crawl.CrawlDb - CrawlDb update:
> >> >> starting
> >> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: db:
> >> >> crawled/crawldb
> >> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:
> >> >> segments:
> >> >> [crawled/segments/20080803031321]
> >> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:
> >> >> additions
> >> >> allowed: true
> >> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
> >> >> normalizing: true
> >> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
> >> >> filtering: true
> >> >> 2008-08-03 03:21:26,866 INFO  crawl.CrawlDb - CrawlDb update: Merging
> >> >> segment data into db.
> >> >> 2008-08-03 03:22:13,223 INFO  crawl.CrawlDb - CrawlDb update: done
> >> >> 2008-08-03 03:22:14,251 INFO  crawl.Generator - Generator: Selecting
> >> >> best-scoring urls due for fetch.
> >> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: starting
> >> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: segment:
> >> >> crawled/segments/20080803032214
> >> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator:
> >> >> filtering: false
> >> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: topN:
> >> >> 2147483647
> >> >> 2008-08-03 03:22:34,459 INFO  crawl.Generator - Generator:
> >> >> Partitioning
> >> >> selected urls by host, for politeness.
> >> >> 2008-08-03 03:22:59,733 INFO  crawl.Generator - Generator: done.
> >> >> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: starting
> >> >> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: segment:
> >> >> crawled/segments/20080803032214
> >> >> 2008-08-03 04:24:53,193 INFO  fetcher.Fetcher - Fetcher: done
> >> >>
> >> >> What i found executing the command:
> >> >> bin/hadoop dfs -ls
> >> >> Found 2 items
> >> >> /user/nutch/crawled     <dir>
> >> >> /user/nutch/urls        <dir>
> >> >> $ bin/hadoop dfs -ls crawled
> >> >> Found 2 items
> >> >> /user/nutch/crawled/crawldb     <dir>
> >> >> /user/nutch/crawled/segments    <dir>
> >> >>
> >> >> Where is linkdb,indexes and index? So pls tell me which may be the
> >> >> error.
> >> >>
> >> >> Here is my hadoop-site.xml:
> >> >>
> >> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >> >>
> >> >> <!-- Put site-specific property overrides in this file. -->
> >> >>
> >> >> <configuration>
> >> >> <property>
> >> >> <name>fs.default.name</name>
> >> >> <value>sysmonitor:9000</value>
> >> >> <description>
> >> >>   The name of the default file system. Either the literal string
> >> >>   "local" or a host:port for NDFS.
> >> >> </description>
> >> >> </property>
> >> >> <property>
> >> >> <name>mapred.job.tracker</name>
> >> >> <value>sysmonitor:9001</value>
> >> >> <description>
> >> >>   The host and port that the MapReduce job tracker runs at. If
> >> >>   "local", then jobs are run in-process as a single map and
> >> >>   reduce task.
> >> >> </description>
> >> >> </property>
> >> >> <property>
> >> >> <name>mapred.tasktracker.tasks.maximum</name>
> >> >> <value>2</value>
> >> >> <description>
> >> >>   The maximum number of tasks that will be run simultaneously by
> >> >>   a task tracker. This should be adjusted according to the heap size
> >> >>   per task, the amount of RAM available, and CPU consumption of
> >> >> each task.
> >> >> </description>
> >> >> </property>
> >> >> <property>
> >> >> <name>mapred.child.java.opts</name>
> >> >> <value>-Xmx200m</value>
> >> >> <description>
> >> >>   You can specify other Java options for each map or reduce task
> >> >> here,
> >> >>   but most likely you will want to adjust the heap size.
> >> >> </description>
> >> >> </property>
> >> >> <property>
> >> >> <name>dfs.name.dir</name>
> >> >> <value>/nutch/filesystem/name</value>
> >> >> </property>
> >> >> <property>
> >> >> <name>dfs.data.dir</name>
> >> >> <value>/nutch/filesystem/data</value>
> >> >> </property>
> >> >>
> >> >> <property>
> >> >> <name>mapred.system.dir</name>
> >> >> <value>/nutch/filesystem/mapreduce/system</value>
> >> >> </property>
> >> >> <property>
> >> >> <name>mapred.local.dir</name>
> >> >> <value>/nutch/filesystem/mapreduce/local</value>
> >> >> </property>
> >> >>
> >> >> <property>
> >> >> <name>dfs.replication</name>
> >> >> <value>1</value>
> >> >> </property>
> >> >> </configuration>
> >> >>
> >> >>
> >> >> My urls/urllist.txt contains almost 100 seed urls and depth is 10
> >> >> but it
> >> >> seems there is  little crawling done.
> >> >>
> >> >>
> >> >> regards
> >> >> --monirul
> >> >>
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Best Regards
> >> > Alexander Aristov
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >>
> >
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >
>



-- 
Best Regards
Alexander Aristov

Re: problem in crawling

Reply via email to