Re: problem in crawling

Mohammad Monirul Hoque Mon, 04 Aug 2008 23:32:54 -0700

Hi,

What i only modify in crawl-urlfilter.txt is to add the line


+^http://([a-z0-9]*\.)*wikipedia.org/

I also commented out the previous line like the following:

#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

I also tried many other  urls  but  each time  it  returned  same  type  of  
result. 

Another imp things : I am trying nutch on ubuntu now which is showing problem 
but when i used it in fedora core 8 it just worked fine.

I was trying previously on pseudo-distributed  mode  but  after having problem 
i tried yesterday  in stand-alone mode it returned same type of result.

When i see the hadoop.log it indicates that lots of pages were being fetched  
with  lots of  error,  fatal error  regarding  http.robots.agents,
parser not found, java.net.SocketTimeOut exection etc.

Pls tell me where i m wrong.

regards,
--monirul
 



----- Original Message ----
From: Tristan Buckner <[EMAIL PROTECTED]>
To: [email protected]
Sent: Tuesday, August 5, 2008 12:46:21 AM
Subject: Re: problem in crawling

Are your urls of the form 
http://en.wikipedia.org/w/wiki.phtml?title=_&curid=foo 
  ?  If it does the robots file excludes these.

Also is there a line above that line for which the urls fail?

On Aug 4, 2008, at 11:37 AM, Mohammad Monirul Hoque wrote:

> Hi,
>
> Thanks for ur reply. In my crawl-urlfilter.txt i included the  
> following line
>
> +^http://([a-z0-9]*\.)*wikipedia.org/  as i want to crawl wiki.
>
> My urls/urllist.txt contains urls of wikipedia like below:
>
> http://en.wikipedia.org/
>
> I used nutch 0.9 previously in fedora 8.It worked fine.
>
> So pls tell me if u have any idea.
>
> best regards,
>
> --monirul
>
>
>
>
> ----- Original Message ----
> From: Alexander Aristov <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Monday, August 4, 2008 1:28:58 PM
> Subject: Re: problem in crawling
>
> Hi
>
> what is in your crawl -urlfilter.txt file?
>
> Did you include your URLs in the filter? By default all urls are  
> excluded.
>
> Alexander
>
> 2008/8/3 Mohammad Monirul Hoque <[EMAIL PROTECTED]>
>
>> Hi,
>>
>> I m using nutch 0.9 on ubuntu on a single machine with pseudo- 
>> distributed
>> mode.
>> When i executing  the  following command
>>
>> bin/nutch crawl urls -dir crawled -depth 10
>>
>> this is what i got from the hadoop log:
>>
>> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - crawl started in: crawled
>> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - rootUrlDir = urls
>> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - threads = 10
>> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - depth = 10
>> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: starting
>> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: crawlDb:
>> crawled/crawldb
>> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: urlDir: urls
>> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: Converting
>> injected urls to crawl db entries.
>> 2008-08-03 03:10:35,227 INFO  crawl.Injector - Injector: Merging  
>> injected
>> urls into crawl db.
>> 2008-08-03 03:10:59,724 INFO  crawl.Injector - Injector: done
>> 2008-08-03 03:11:00,791 INFO  crawl.Generator - Generator: Selecting
>> best-scoring urls due for fetch.
>> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: starting
>> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: segment:
>> crawled/segments/20080803031100
>> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator:  
>> filtering: false
>> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: topN:  
>> 2147483647
>> 2008-08-03 03:11:24,239 INFO  crawl.Generator - Generator:  
>> Partitioning
>> selected urls by host, for politeness.
>> 2008-08-03 03:11:47,583 INFO  crawl.Generator - Generator: done.
>> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: starting
>> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: segment:
>> crawled/segments/20080803031100
>> 2008-08-03 03:12:36,915 INFO  fetcher.Fetcher - Fetcher: done
>> 2008-08-03 03:12:36,951 INFO  crawl.CrawlDb - CrawlDb update:  
>> starting
>> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: db:
>> crawled/crawldb
>> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:  
>> segments:
>> [crawled/segments/20080803031100]
>> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:  
>> additions
>> allowed: true
>> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
>> normalizing: true
>> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
>> filtering: true
>> 2008-08-03 03:12:36,967 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> segment data into db.
>> 2008-08-03 03:13:20,341 INFO  crawl.CrawlDb - CrawlDb update: done
>> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: Selecting
>> best-scoring urls due for fetch.
>> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: starting
>> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: segment:
>> crawled/segments/20080803031321
>> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator:  
>> filtering: false
>> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: topN:  
>> 2147483647
>> 2008-08-03 03:13:39,667 INFO  crawl.Generator - Generator:  
>> Partitioning
>> selected urls by host, for politeness.
>> 2008-08-03 03:14:04,963 INFO  crawl.Generator - Generator: done.
>> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: starting
>> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: segment:
>> crawled/segments/20080803031321
>> 2008-08-03 03:21:26,809 INFO  fetcher.Fetcher - Fetcher: done
>> 2008-08-03 03:21:26,851 INFO  crawl.CrawlDb - CrawlDb update:  
>> starting
>> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: db:
>> crawled/crawldb
>> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:  
>> segments:
>> [crawled/segments/20080803031321]
>> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:  
>> additions
>> allowed: true
>> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
>> normalizing: true
>> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
>> filtering: true
>> 2008-08-03 03:21:26,866 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> segment data into db.
>> 2008-08-03 03:22:13,223 INFO  crawl.CrawlDb - CrawlDb update: done
>> 2008-08-03 03:22:14,251 INFO  crawl.Generator - Generator: Selecting
>> best-scoring urls due for fetch.
>> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: starting
>> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: segment:
>> crawled/segments/20080803032214
>> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator:  
>> filtering: false
>> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: topN:  
>> 2147483647
>> 2008-08-03 03:22:34,459 INFO  crawl.Generator - Generator:  
>> Partitioning
>> selected urls by host, for politeness.
>> 2008-08-03 03:22:59,733 INFO  crawl.Generator - Generator: done.
>> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: starting
>> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: segment:
>> crawled/segments/20080803032214
>> 2008-08-03 04:24:53,193 INFO  fetcher.Fetcher - Fetcher: done
>>
>> What i found executing the command:
>> bin/hadoop dfs -ls
>> Found 2 items
>> /user/nutch/crawled     <dir>
>> /user/nutch/urls        <dir>
>> $ bin/hadoop dfs -ls crawled
>> Found 2 items
>> /user/nutch/crawled/crawldb     <dir>
>> /user/nutch/crawled/segments    <dir>
>>
>> Where is linkdb,indexes and index? So pls tell me which may be the  
>> error.
>>
>> Here is my hadoop-site.xml:
>>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>> <name>fs.default.name</name>
>> <value>sysmonitor:9000</value>
>> <description>
>>   The name of the default file system. Either the literal string
>>   "local" or a host:port for NDFS.
>> </description>
>> </property>
>> <property>
>> <name>mapred.job.tracker</name>
>> <value>sysmonitor:9001</value>
>> <description>
>>   The host and port that the MapReduce job tracker runs at. If
>>   "local", then jobs are run in-process as a single map and
>>   reduce task.
>> </description>
>> </property>
>> <property>
>> <name>mapred.tasktracker.tasks.maximum</name>
>> <value>2</value>
>> <description>
>>   The maximum number of tasks that will be run simultaneously by
>>   a task tracker. This should be adjusted according to the heap size
>>   per task, the amount of RAM available, and CPU consumption of  
>> each task.
>> </description>
>> </property>
>> <property>
>> <name>mapred.child.java.opts</name>
>> <value>-Xmx200m</value>
>> <description>
>>   You can specify other Java options for each map or reduce task  
>> here,
>>   but most likely you will want to adjust the heap size.
>> </description>
>> </property>
>> <property>
>> <name>dfs.name.dir</name>
>> <value>/nutch/filesystem/name</value>
>> </property>
>> <property>
>> <name>dfs.data.dir</name>
>> <value>/nutch/filesystem/data</value>
>> </property>
>>
>> <property>
>> <name>mapred.system.dir</name>
>> <value>/nutch/filesystem/mapreduce/system</value>
>> </property>
>> <property>
>> <name>mapred.local.dir</name>
>> <value>/nutch/filesystem/mapreduce/local</value>
>> </property>
>>
>> <property>
>> <name>dfs.replication</name>
>> <value>1</value>
>> </property>
>> </configuration>
>>
>>
>> My urls/urllist.txt contains almost 100 seed urls and depth is 10  
>> but it
>> seems there is  little crawling done.
>>
>>
>> regards
>> --monirul
>>
>>
>>
>
>
>
>
> -- 
> Best Regards
> Alexander Aristov
>
>
>

Re: problem in crawling

Reply via email to