Nutch Newbie wrote:
Hi:

Could some please be kind enough to confirm if the 0.9-dev trunk is
broken. I did a total of 4 fresh install and every time I am getting
stuck in indexing/reduce process. (Yes Speculative = false).

It would feel much better if I am not the only one with this problem!

Thank you for your help.


On 1/8/07, Nutch Newbie <[EMAIL PROTECTED]> wrote:
Hi:

I am getting the following error after updating to revision 494024. My
Hadoop-site.xml (mapred.speculative) set to false .. I am not sure
what I am doing wrong.. everything worked before the update.. Any
help..

Regards

Language identifier configuration [1-4/2048]
 map 100% reduce 0%
Language identifier plugin supports: it(1000) is(1000) hu(1000)
th(1000) sv(1000) fr(1000) ru(1000) fi(1000) es(1000) en(1000)
el(1000) ee(1000) pt(1000) de(1000) da(1000) pl(1000) no(1000)
nl(1000)
Adding org.apache.nutch.analysis.lang.LanguageIndexingFilter
running sort pass
flushing segment 0
reduce > sort
found resource common-terms.utf8 at
file:/usr/local/nutch-0.9-dev/conf/common-terms.utf8
Optimizing index.
Optimizing index.
job_qmhsvz
java.lang.RuntimeException: Unexpected status: 67
        at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:137)
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:297)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:134)

I can confirm that indeed it is a bug. I'll provide a patch soon - in the meantime you can just remove the "throws" clause - other datums will simply be ignored.

The underlying issue is quite interesting - the status code that it's complaining about is CrawlDatum.STATUS_LINKED, which indicates a page that was redirected. However, as you can see there are probably some inlinks pointing to this page. Now, the question is - should we discard this page (and index only the target)? The answer is not simple.

BTW. if you guys are brave enough to use the bleeding-edge from SVN, then you are expected to discuss any issues that may arise from its use on nutch-dev - this mailing list is for users of regular releases, or stable versions ...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to