Hello,
It's on 2 linux boxes one with centos and one with ubuntu. Both properly
running "old" bin/nutch crawl.
Problem is that it doesn't give exception on command line or in eclipse
just writes to logs so it's hard to debug.
One is running nutch trunk from 07 march, and one from todays rc1
Any hints? Maybe some logs properties or sth?
In hadoop.log it looks exactly the same:
2009-03-09 12:12:09,452 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2009-03-09 12:12:09,452 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2009-03-09 12:12:09,560 INFO field.FieldIndexer - IFD [Thread-11]:
setInfoStream
deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6210fb
2009-03-09 12:12:09,560 INFO field.FieldIndexer - IW 0 [Thread-11]:
setInfoStream:
dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-agniesia441/mapred/local/index/_-174719952
autoCommit=true
mergepolicy=org.apache.lucene.index.logbytesizemergepol...@48edb5
mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@1ee2c2c
ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1
maxFieldLength=10000 index=
2009-03-09 12:12:09,585 WARN mapred.LocalJobRunner - job_local_0001
java.lang.NullPointerException
at
org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139)
at
org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:1)
at
org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
at
org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239)
at
org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:1)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
2009-03-09 12:12:10,021 FATAL field.FieldIndexer - FieldIndexer:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at
org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)
at
org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)
Thanks,
Bartosz
Dennis Kubes pisze:
Sorry about the docs being sparse on this. I will write more about
the process as time permits. Don't know about the problem below.
What platform are you running on, windows, linux?
Dennis
Bartosz Gadzimski wrote:
Hello,
Thanks Dennis for updateing wiki it helped a lot.
You gave example with indexing but you didn't said a bit about it.
Can you write some more? :)
Anyways I have problems at the last step (nutch from 07 march):
bin/nutch org.apache.nutch.indexer.field.FieldIndexer
It simply stops somewhere
2009-03-07 16:09:04,432 INFO field.FieldIndexer - FieldIndexer:
starting
2009-03-07 16:09:04,436 INFO field.FieldIndexer - FieldIndexer:
adding fields db: crawl/fields/basicfields
2009-03-07 16:09:04,498 INFO field.FieldIndexer - FieldIndexer:
adding fields db: crawl/fields/anchorfields
2009-03-07 16:09:05,636 INFO plugin.PluginRepository - Plugins:
looking in: /usr/local/nutch/plugins
2009-03-07 16:09:06,437 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2009-03-07 16:09:06,437 INFO plugin.PluginRepository - Registered
Plugins:
2009-03-07 16:09:06,437 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2009-03-07 16:09:06,437 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
.... plugins....
2009-03-07 16:09:07,769 INFO field.FieldIndexer - IFD [Thread-11]:
setInfoStream
deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@1b4a74b
2009-03-07 16:09:07,769 INFO field.FieldIndexer - IW 0 [Thread-11]:
setInfoStream:
dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-root/mapred/local/index/_-884655313
autoCommit=true
mergepolicy=org.apache.lucene.index.logbytesizemergepol...@15356d5
mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@69d02b
ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1
maxFieldLength=10000 index=
2009-03-07 16:09:07,781 WARN mapred.LocalJobRunner - job_local_0001
java.lang.NullPointerException
at
org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139)
at
org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:131)
at
org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
at
org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239)
at
org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:69)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
2009-03-07 16:09:08,197 FATAL field.FieldIndexer - FieldIndexer:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at
org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)
at
org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)
In crawl/indexes is only _temporary folder.
I will try to debug this but have problems with running nutch in eclipse
Thanks,
Bartosz
Dennis Kubes pisze:
I don't know if I would make this primary yet. I need to check what
is causing this as it worked fine for me, in fact we currently have
it in production. Also we would need to update the shell scripts to
integrate this more tightly.
Dennis
Bartosz Gadzimski wrote:
Sami Siren pisze:
Andrzej Bialecki wrote:
Sami Siren wrote:
I am planning to build the first rc for nutch 1.0 at Tue
3.3.2009 morning (EET). There are still some issues marked as
fix for 1.0 in Jira. Neither of the two remaining _bugs_ seems
too important to me, actually I only count the issues assigned
to developers as real candidates to be included in 1.0:
NUTCH-578 (kubes)
NUTCH-477 (ab)
NUTCH-669 (siren)
There's one Critical issue reported, related to NekoHTML
(NUTCH-700). I'm not sure what are the feature differences
(pertinent to Nutch) between 0.9.4 and 1.9.11 - perhaps
downgrading is the safest course of action.
I will take care of that.
I am also volunteering to push all open issues to 1.1 before
starting the RC build on Tuesday. Any objections on the proposed
procedure or timing?
Sounds good.
great!
--
Sami Siren
What about new scoring and new indexing? Will it be integrated as a
primary scoring algorithm? I have problem with it on LinkRank:
2009-03-02 20:43:45,708 INFO webgraph.LinkRank - Starting link
counter job
2009-03-02 20:43:47,838 INFO webgraph.LinkRank - Finished link
counter job
2009-03-02 20:43:47,839 INFO webgraph.LinkRank - Reading numlinks
temp file
2009-03-02 20:43:47,840 INFO webgraph.LinkRank - Deleting numlinks
temp file
2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis:
java.lang.NullPointerException
at
org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113)
at
org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)
at
org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)
Another question what about indexing framework mentioned here:
http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg11764.html
Have all those new scoring and indexing would be real step forward.
Thanks,
Bartosz