Re: [Nutch-general] Problem in Distributed crawling using nutch 0.8

Mohan Lal Fri, 29 Sep 2006 00:01:46 -0700

Hi,

thanks for your valuable information, i have solved that problem after that
iam facing another problem ....
i have 2 slaves
 1)  MAC1
  2)  MAC2


but the job was running in MAC1 itself, and it take a long time to finish
the crawling process
how can i assign job to distributed machines i specified in tha slaves file
?

But my Crowling process done successfully..........also how ccan i specify
the searcher dir in the nutch-site.xml file

     <property>
          <name>searcher.dir</name>
          <value> ? </value>  
     </property>

please help me.........


I have done the following setting.....

[EMAIL PROTECTED] ~]# cd /home/lucene/nutch-0.8.1/
[EMAIL PROTECTED] nutch-0.8.1]# bin/hadoop namenode -format
Re-format filesystem in /tmp/hadoop/dfs/name ? (Y or N) Y
Formatted /tmp/hadoop/dfs/name
[EMAIL PROTECTED] nutch-0.8.1]# bin/start-all.sh
starting namenode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
amenode-mohanlal.qburst.local.out
fpo: ssh: fpo: Name or service not known
localhost: starting datanode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/ha
doop-root-datanode-mohanlal.qburst.local.out
starting jobtracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
-jobtracker-mohanlal.qburst.local.out
fpo: ssh: fpo: Name or service not known
localhost: starting tasktracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs
/hadoop-root-tasktracker-mohanlal.qburst.local.out
[EMAIL PROTECTED] nutch-0.8.1]# bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
sonu: no tasktracker to stop
stopping namenode
sonu: no datanode to stop
localhost: stopping datanode
[EMAIL PROTECTED] nutch-0.8.1]# bin/start-all.sh
starting namenode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
amenode-mohanlal.qburst.local.out
sonu: starting datanode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-
root-datanode-sonu.qburst.local.out
localhost: starting datanode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/ha
doop-root-datanode-mohanlal.qburst.local.out
starting jobtracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
-jobtracker-mohanlal.qburst.local.out
localhost: starting tasktracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs
/hadoop-root-tasktracker-mohanlal.qburst.local.out
sonu: starting tasktracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hado
op-root-tasktracker-sonu.qburst.local.out
[EMAIL PROTECTED] nutch-0.8.1]# bin/hadoop dfs -put  urls urls
[EMAIL PROTECTED] nutch-0.8.1]# bin/nutch crawl urls -dir crawl.1 -depth 2
-topN 10 crawl started in: crawl.1
rootUrlDir = urls
threads = 100
depth = 2
topN = 10
Injector: starting
Injector: crawlDb: crawl.1/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: starting
Generator: segment: crawl.1/segments/20060929120038
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl.1/segments/20060929120038
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl.1/crawldb
CrawlDb update: segment: crawl.1/segments/20060929120038
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
Generator: segment: crawl.1/segments/20060929120235
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl.1/segments/20060929120235
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl.1/crawldb
CrawlDb update: segment: crawl.1/segments/20060929120235
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl.1/linkdb
LinkDb: adding segment: /user/root/crawl.1/segments/20060929120038
LinkDb: adding segment: /user/root/crawl.1/segments/20060929120235
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl.1/linkdb
Indexer: adding segment: /user/root/crawl.1/segments/20060929120038
Indexer: adding segment: /user/root/crawl.1/segments/20060929120235
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl.1/indexes
Dedup: done
Adding /user/root/crawl.1/indexes/part-00000
Adding /user/root/crawl.1/indexes/part-00001
crawl finished: crawl.1


Thanks and Regards
Mohanlal


&quot;H?vard W. Kongsg?rd&quot;-2 wrote:
> 
> Do /user/root/url exist, have you uploaded  the url folder to you dfs
> system?
> 
> bin/hadoop dfs -mkdir urls
> bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt
> 
> or
> 
> bin/hadoop -put <localsrc> <dst>
> 
> 
> Mohan Lal wrote:
>> Hi all,
>>
>> While iam try to crawl using distributed machines its throw an error
>>
>> bin/nutch crawl urls -dir crawl -depth 10 -topN 50
>> crawl started in: crawl
>> rootUrlDir = urls
>> threads = 10
>> depth = 10
>> topN = 50
>> Injector: starting
>> Injector: crawlDb: crawl/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Exception in thread "main" java.io.IOException: Input directory
>> /user/root/urls in localhost:9000 is invalid.
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>>         at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>>
>> whats wrong with my configuration,  please help  me..................
>>
>>
>> Regards
>> Mohan Lal 
>>   
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Problem-in-Distributed-crawling-using-nutch-0.8-tf2348922.html#a6560245
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Problem in Distributed crawling using nutch 0.8

Reply via email to