Re: problem in crawling

Mohammad Monirul Hoque Mon, 04 Aug 2008 11:39:34 -0700

Hi,

Thanks for ur reply. In my crawl-urlfilter.txt i included the following line


+^http://([a-z0-9]*\.)*wikipedia.org/  as i want to crawl wiki.

My urls/urllist.txt contains urls of wikipedia like below:

http://en.wikipedia.org/

I used nutch 0.9 previously in fedora 8.It worked fine.

So pls tell me if u have any idea.

best regards,

--monirul




----- Original Message ----
From: Alexander Aristov <[EMAIL PROTECTED]>
To: [email protected]
Sent: Monday, August 4, 2008 1:28:58 PM
Subject: Re: problem in crawling

Hi

what is in your crawl -urlfilter.txt file?

Did you include your URLs in the filter? By default all urls are excluded.

Alexander

2008/8/3 Mohammad Monirul Hoque <[EMAIL PROTECTED]>

> Hi,
>
> I m using nutch 0.9 on ubuntu on a single machine with pseudo-distributed
>  mode.
> When i executing  the  following command
>
> bin/nutch crawl urls -dir crawled -depth 10
>
> this is what i got from the hadoop log:
>
> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - crawl started in: crawled
> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - rootUrlDir = urls
> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - threads = 10
> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - depth = 10
> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: starting
> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: crawlDb:
> crawled/crawldb
> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: urlDir: urls
> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: Converting
> injected urls to crawl db entries.
> 2008-08-03 03:10:35,227 INFO  crawl.Injector - Injector: Merging injected
> urls into crawl db.
> 2008-08-03 03:10:59,724 INFO  crawl.Injector - Injector: done
> 2008-08-03 03:11:00,791 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: starting
> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: segment:
> crawled/segments/20080803031100
> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: filtering: false
> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: topN: 2147483647
> 2008-08-03 03:11:24,239 INFO  crawl.Generator - Generator: Partitioning
> selected urls by host, for politeness.
> 2008-08-03 03:11:47,583 INFO  crawl.Generator - Generator: done.
> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: starting
> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: segment:
> crawled/segments/20080803031100
> 2008-08-03 03:12:36,915 INFO  fetcher.Fetcher - Fetcher: done
> 2008-08-03 03:12:36,951 INFO  crawl.CrawlDb - CrawlDb update: starting
> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: db:
> crawled/crawldb
> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: segments:
> [crawled/segments/20080803031100]
> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: additions
> allowed: true
> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
> normalizing: true
> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
> filtering: true
> 2008-08-03 03:12:36,967 INFO  crawl.CrawlDb - CrawlDb update: Merging
> segment data into db.
> 2008-08-03 03:13:20,341 INFO  crawl.CrawlDb - CrawlDb update: done
> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: starting
> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: segment:
> crawled/segments/20080803031321
> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: filtering: false
> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: topN: 2147483647
> 2008-08-03 03:13:39,667 INFO  crawl.Generator - Generator: Partitioning
> selected urls by host, for politeness.
> 2008-08-03 03:14:04,963 INFO  crawl.Generator - Generator: done.
> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: starting
> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: segment:
> crawled/segments/20080803031321
> 2008-08-03 03:21:26,809 INFO  fetcher.Fetcher - Fetcher: done
> 2008-08-03 03:21:26,851 INFO  crawl.CrawlDb - CrawlDb update: starting
> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: db:
> crawled/crawldb
> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: segments:
> [crawled/segments/20080803031321]
> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: additions
> allowed: true
> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
> normalizing: true
> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
> filtering: true
> 2008-08-03 03:21:26,866 INFO  crawl.CrawlDb - CrawlDb update: Merging
> segment data into db.
> 2008-08-03 03:22:13,223 INFO  crawl.CrawlDb - CrawlDb update: done
> 2008-08-03 03:22:14,251 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: starting
> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: segment:
> crawled/segments/20080803032214
> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: filtering: false
> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: topN: 2147483647
> 2008-08-03 03:22:34,459 INFO  crawl.Generator - Generator: Partitioning
> selected urls by host, for politeness.
> 2008-08-03 03:22:59,733 INFO  crawl.Generator - Generator: done.
> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: starting
> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: segment:
> crawled/segments/20080803032214
> 2008-08-03 04:24:53,193 INFO  fetcher.Fetcher - Fetcher: done
>
> What i found executing the command:
> bin/hadoop dfs -ls
> Found 2 items
> /user/nutch/crawled     <dir>
> /user/nutch/urls        <dir>
> $ bin/hadoop dfs -ls crawled
> Found 2 items
> /user/nutch/crawled/crawldb     <dir>
> /user/nutch/crawled/segments    <dir>
>
> Where is linkdb,indexes and index? So pls tell me which may be the error.
>
> Here is my hadoop-site.xml:
>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>  <name>fs.default.name</name>
>  <value>sysmonitor:9000</value>
>  <description>
>    The name of the default file system. Either the literal string
>    "local" or a host:port for NDFS.
>  </description>
> </property>
> <property>
>  <name>mapred.job.tracker</name>
>  <value>sysmonitor:9001</value>
>  <description>
>    The host and port that the MapReduce job tracker runs at. If
>    "local", then jobs are run in-process as a single map and
>    reduce task.
>  </description>
> </property>
> <property>
>  <name>mapred.tasktracker.tasks.maximum</name>
>  <value>2</value>
>  <description>
>    The maximum number of tasks that will be run simultaneously by
>    a task tracker. This should be adjusted according to the heap size
>    per task, the amount of RAM available, and CPU consumption of each task.
>  </description>
> </property>
> <property>
>  <name>mapred.child.java.opts</name>
>  <value>-Xmx200m</value>
>  <description>
>    You can specify other Java options for each map or reduce task here,
>    but most likely you will want to adjust the heap size.
>  </description>
> </property>
> <property>
>  <name>dfs.name.dir</name>
>  <value>/nutch/filesystem/name</value>
> </property>
> <property>
>  <name>dfs.data.dir</name>
>  <value>/nutch/filesystem/data</value>
> </property>
>
> <property>
>  <name>mapred.system.dir</name>
>  <value>/nutch/filesystem/mapreduce/system</value>
> </property>
> <property>
>  <name>mapred.local.dir</name>
>  <value>/nutch/filesystem/mapreduce/local</value>
> </property>
>
> <property>
>  <name>dfs.replication</name>
>  <value>1</value>
> </property>
> </configuration>
>
>
> My urls/urllist.txt contains almost 100 seed urls and depth is 10 but it
> seems there is  little crawling done.
>
>
> regards
> --monirul
>
>
>




-- 
Best Regards
Alexander Aristov

Re: problem in crawling

Reply via email to