[Nutch-general] Nutch 0.8.1 problems

Oleg V. Konovalov Wed, 21 Feb 2007 06:39:27 -0800

Hello, colleagues!

I've a problems, connected with Nutch 0.8.1 startup/usage with Hadoop/Tomcat.


//System distribution - FC3 (clone), basic configuration. 
I've downloaded Apache Tomcat 5.5.20 - binary distr., Apache ANT 1.7.0 - binary 
distr. JDK 1.5.0-05 from SUN (also binary distr.).

Nutch builded successfully (with some warnings), but in build.xml we need to 
comment one block of code, else Ant won't build Nutch (am I right?):

<touch datetime="01/25/1971 2:00 pm">
      <fileset dir="${conf.dir}" includes="**/*.template"/>
      <fileset dir="${contrib.dir}" includes="**/*.template"/>
</touch>

Next step - running Tomcat, which is underlay for Nutch (in my case). Well, 
previously I build "nutch-*.war" file, so rename it to ROOT.war, place into 
"webapps" directory, and restarting tomcat, as described in the (thin) 
tutorials about Nutch. Web-part works with some problems, but it's future, as 
for nowtime - we need to run Hadoop, and Nutch must be able to work with it.

Hadoop builded (with same problems) configured (according to tutorial) and 
first problem has a place: in the hadoop-site.xml config no possibility to use 
recomended "local" values - Hadoop didn't start at all, so, replace these 
literal strings to "localhost:900x" (according to instructions) and trying to 
start Hadoop instance. Well, it starts, and possibly works, - I've see no 
errors (console/logs), so, trying to run Nutch.

And here we've a set of problems of different types.

Nutch didn't work... :( 

First, we try to generate crawldb/segments

switch to "nutch" user first:

bash-3.00# su - nutch

secondary - start Hadoop:

-bash-3.00$ cd ../search/
-bash-3.00$ bin/start-all.sh

starting namenode, logging to 
/nutch/search/../var/logs/hadoop-nutch-namenode-workstation15.dom2.out
localhost: starting datanode, logging to 
/nutch/search/../var/logs/hadoop-nutch-datanode-workstation15.dom2.out
starting jobtracker, logging to 
/nutch/search/../var/logs/hadoop-nutch-jobtracker-workstation15.dom2.out
localhost: starting tasktracker, logging to 
/nutch/search/../var/logs/hadoop-nutch-tasktracker-workstation15.dom2.out

OK, next, "generate":

-bash-3.00$ bin/nutch generate crawl/crawldb crawl/segments
Generator: starting
Generator: segment: crawl/segments/20070221171048
Generator: Selecting best-scoring urls due for fetch.
Exception in thread "main" java.io.IOException: Input directory 
/user/nutch/crawl/crawldb/current in localhost:9000 is invalid.
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
        at org.apache.nutch.crawl.Generator.generate(Generator.java:319)
        at org.apache.nutch.crawl.Generator.main(Generator.java:395)

And so on...
>From this time-point, step-to-left, step-to-right - same results. I've tried 
>many ways, but no success...

Main problems I've seen - no examples, no documentation, no tutorials, which 
must be usable and answering my questions ...

Any ideas? Somebody, help!

Thanks...

--
Oleg.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Nutch 0.8.1 problems

Reply via email to