Re: [Nutch-general] Problem in Distributed crawling using nutch 0.8

Mohan Lal Sat, 30 Sep 2006 00:32:48 -0700

Hi ,

 I have 3 slaves mentiond in the  conf/slave file, also started all process
using bin/nutch start-all.sh and i have started crawling using the command
bin/nutch crawl -dir crawld -depth 30 -topN 50. and crowld
successfully....no problem


but all the jobs are executed in localhost Machine, is it possible to split
all the jobs into the 3 slave machines ?
if so how can i do ? please help me its urgent..............

http://localhost:50030/

there is only one node have been displayed 

Maps    Reduces Tasks/Node      Nodes
0                0             2                        1


Regards
Mohan Lal



&quot;H?vard W. Kongsg?rd&quot;-2 wrote:
> 
> see: 
> http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E
> 
> Before you start tomcat remeber to change the path of your search
> directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes
> directory 
> 
> #This is an example of my configuration 
> 
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
>   <property>
>     <name>fs.default.name</name>
>     <value>LSearchDev01:9000</value>
>   </property>
> 
>   <property>
>     <name>searcher.dir</name>
>     <value>/user/root/crawld</value>
>   </property>
> 
> </configuration>
> 
> 
> 
> Mohan Lal wrote:
>> Hi,
>>
>> thanks for your valuable information, i have solved that problem after
>> that
>> iam facing another problem ....
>> i have 2 slaves
>>  1)  MAC1
>>   2)  MAC2
>>
>> but the job was running in MAC1 itself, and it take a long time to finish
>> the crawling process
>> how can i assign job to distributed machines i specified in tha slaves
>> file
>> ?
>>
>> But my Crowling process done successfully..........also how ccan i
>> specify
>> the searcher dir in the nutch-site.xml file
>>
>>      <property>
>>           <name>searcher.dir</name>
>>           <value> ? </value>  
>>      </property>
>>
>> please help me.........
>>
>>
>> I have done the following setting.....
>>
>> [EMAIL PROTECTED] ~]# cd /home/lucene/nutch-0.8.1/
>> [EMAIL PROTECTED] nutch-0.8.1]# bin/hadoop namenode -format
>> Re-format filesystem in /tmp/hadoop/dfs/name ? (Y or N) Y
>> Formatted /tmp/hadoop/dfs/name
>> [EMAIL PROTECTED] nutch-0.8.1]# bin/start-all.sh
>> starting namenode, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
>> amenode-mohanlal.qburst.local.out
>> fpo: ssh: fpo: Name or service not known
>> localhost: starting datanode, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/ha
>> doop-root-datanode-mohanlal.qburst.local.out
>> starting jobtracker, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
>> -jobtracker-mohanlal.qburst.local.out
>> fpo: ssh: fpo: Name or service not known
>> localhost: starting tasktracker, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs
>> /hadoop-root-tasktracker-mohanlal.qburst.local.out
>> [EMAIL PROTECTED] nutch-0.8.1]# bin/stop-all.sh
>> stopping jobtracker
>> localhost: stopping tasktracker
>> sonu: no tasktracker to stop
>> stopping namenode
>> sonu: no datanode to stop
>> localhost: stopping datanode
>> [EMAIL PROTECTED] nutch-0.8.1]# bin/start-all.sh
>> starting namenode, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
>> amenode-mohanlal.qburst.local.out
>> sonu: starting datanode, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-
>> root-datanode-sonu.qburst.local.out
>> localhost: starting datanode, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/ha
>> doop-root-datanode-mohanlal.qburst.local.out
>> starting jobtracker, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
>> -jobtracker-mohanlal.qburst.local.out
>> localhost: starting tasktracker, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs
>> /hadoop-root-tasktracker-mohanlal.qburst.local.out
>> sonu: starting tasktracker, logging to
>> /home/lucene/nutch-0.8.1/bin/../logs/hado
>> op-root-tasktracker-sonu.qburst.local.out
>> [EMAIL PROTECTED] nutch-0.8.1]# bin/hadoop dfs -put  urls urls
>> [EMAIL PROTECTED] nutch-0.8.1]# bin/nutch crawl urls -dir crawl.1 -depth 2
>> -topN 10 crawl started in: crawl.1
>> rootUrlDir = urls
>> threads = 100
>> depth = 2
>> topN = 10
>> Injector: starting
>> Injector: crawlDb: crawl.1/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Merging injected urls into crawl db.
>> Injector: done
>> Generator: starting
>> Generator: segment: crawl.1/segments/20060929120038
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> Fetcher: starting
>> Fetcher: segment: crawl.1/segments/20060929120038
>> Fetcher: done
>> CrawlDb update: starting
>> CrawlDb update: db: crawl.1/crawldb
>> CrawlDb update: segment: crawl.1/segments/20060929120038
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: done
>> Generator: starting
>> Generator: segment: crawl.1/segments/20060929120235
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> Fetcher: starting
>> Fetcher: segment: crawl.1/segments/20060929120235
>> Fetcher: done
>> CrawlDb update: starting
>> CrawlDb update: db: crawl.1/crawldb
>> CrawlDb update: segment: crawl.1/segments/20060929120235
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: done
>> LinkDb: starting
>> LinkDb: linkdb: crawl.1/linkdb
>> LinkDb: adding segment: /user/root/crawl.1/segments/20060929120038
>> LinkDb: adding segment: /user/root/crawl.1/segments/20060929120235
>> LinkDb: done
>> Indexer: starting
>> Indexer: linkdb: crawl.1/linkdb
>> Indexer: adding segment: /user/root/crawl.1/segments/20060929120038
>> Indexer: adding segment: /user/root/crawl.1/segments/20060929120235
>> Indexer: done
>> Dedup: starting
>> Dedup: adding indexes in: crawl.1/indexes
>> Dedup: done
>> Adding /user/root/crawl.1/indexes/part-00000
>> Adding /user/root/crawl.1/indexes/part-00001
>> crawl finished: crawl.1
>>
>>
>> Thanks and Regards
>> Mohanlal
>>
>>
>> &quot;H?vard W. Kongsg?rd&quot;-2 wrote:
>>   
>>> Do /user/root/url exist, have you uploaded  the url folder to you dfs
>>> system?
>>>
>>> bin/hadoop dfs -mkdir urls
>>> bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt
>>>
>>> or
>>>
>>> bin/hadoop -put <localsrc> <dst>
>>>
>>>
>>> Mohan Lal wrote:
>>>     
>>>> Hi all,
>>>>
>>>> While iam try to crawl using distributed machines its throw an error
>>>>
>>>> bin/nutch crawl urls -dir crawl -depth 10 -topN 50
>>>> crawl started in: crawl
>>>> rootUrlDir = urls
>>>> threads = 10
>>>> depth = 10
>>>> topN = 50
>>>> Injector: starting
>>>> Injector: crawlDb: crawl/crawldb
>>>> Injector: urlDir: urls
>>>> Injector: Converting injected urls to crawl db entries.
>>>> Exception in thread "main" java.io.IOException: Input directory
>>>> /user/root/urls in localhost:9000 is invalid.
>>>>         at
>>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>>>>         at
>>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>>>>         at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>>>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>>>>
>>>> whats wrong with my configuration,  please help  me..................
>>>>
>>>>
>>>> Regards
>>>> Mohan Lal 
>>>>   
>>>>       
>>>
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Problem-in-Distributed-crawling-using-nutch-0.8-tf2348922.html#a6576824
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Problem in Distributed crawling using nutch 0.8

Reply via email to