Re: planning for nutch-1.0-rc1

Bartosz Gadzimski Mon, 09 Mar 2009 04:22:06 -0700

Hello,

It's on 2 linux boxes one with centos and one with ubuntu. Both properlyrunning "old" bin/nutch crawl.Problem is that it doesn't give exception on command line or in eclipsejust writes to logs so it's hard to debug.


One is running nutch trunk from 07 march, and one from todays rc1

Any hints? Maybe some logs properties or sth?

In hadoop.log it looks exactly the same:

2009-03-09 12:12:09,452 INFO plugin.PluginRepository - NutchScoring (org.apache.nutch.scoring.ScoringFilter)2009-03-09 12:12:09,452 INFO plugin.PluginRepository - OntologyModel Loader (org.apache.nutch.ontology.Ontology)2009-03-09 12:12:09,560 INFO field.FieldIndexer - IFD [Thread-11]:setInfoStreamdeletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6210fb2009-03-09 12:12:09,560 INFO field.FieldIndexer - IW 0 [Thread-11]:setInfoStream:dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-agniesia441/mapred/local/index/_-174719952autoCommit=truemergepolicy=org.apache.lucene.index.logbytesizemergepol...@48edb5mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@1ee2c2cramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1maxFieldLength=10000 index=

2009-03-09 12:12:09,585 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.NullPointerException

atorg.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139)atorg.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:1)atorg.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)atorg.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239)atorg.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:1)

       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)

atorg.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)2009-03-09 12:12:10,021 FATAL field.FieldIndexer - FieldIndexer:java.io.IOException: Job failed!

       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)

atorg.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)atorg.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)

       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

atorg.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)



Thanks,
Bartosz


Dennis Kubes pisze:

Sorry about the docs being sparse on this. I will write more aboutthe process as time permits. Don't know about the problem below.What platform are you running on, windows, linux?
Dennis

Bartosz Gadzimski wrote:
Hello,

Thanks Dennis for updateing wiki it helped a lot.
You gave example with indexing but you didn't said a bit about it.Can you write some more? :)
Anyways I have problems at the last step (nutch from 07 march):

bin/nutch org.apache.nutch.indexer.field.FieldIndexer

It simply stops somewhere
2009-03-07 16:09:04,432 INFO field.FieldIndexer - FieldIndexer:starting2009-03-07 16:09:04,436 INFO field.FieldIndexer - FieldIndexer:adding fields db: crawl/fields/basicfields2009-03-07 16:09:04,498 INFO field.FieldIndexer - FieldIndexer:adding fields db: crawl/fields/anchorfields2009-03-07 16:09:05,636 INFO plugin.PluginRepository - Plugins:looking in: /usr/local/nutch/plugins2009-03-07 16:09:06,437 INFO plugin.PluginRepository - PluginAuto-activation mode: [true]2009-03-07 16:09:06,437 INFO plugin.PluginRepository - RegisteredPlugins:2009-03-07 16:09:06,437 INFO plugin.PluginRepository - thenutch core extension points (nutch-extensionpoints)2009-03-07 16:09:06,437 INFO plugin.PluginRepository - BasicQuery Filter (query-basic)
.... plugins....
2009-03-07 16:09:07,769 INFO field.FieldIndexer - IFD [Thread-11]:setInfoStreamdeletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@1b4a74b2009-03-07 16:09:07,769 INFO field.FieldIndexer - IW 0 [Thread-11]:setInfoStream:dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-root/mapred/local/index/_-884655313autoCommit=truemergepolicy=org.apache.lucene.index.logbytesizemergepol...@15356d5mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@69d02bramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1maxFieldLength=10000 index=
2009-03-07 16:09:07,781 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.NullPointerException
atorg.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139)atorg.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:131)atorg.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)atorg.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239)atorg.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:69)
       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
atorg.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)2009-03-07 16:09:08,197 FATAL field.FieldIndexer - FieldIndexer:java.io.IOException: Job failed!
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
atorg.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)atorg.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
atorg.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)
In crawl/indexes is only _temporary folder.

I will try to debug this but have problems with running nutch in eclipse

Thanks,
Bartosz



Dennis Kubes pisze:
I don't know if I would make this primary yet. I need to check whatis causing this as it worked fine for me, in fact we currently haveit in production. Also we would need to update the shell scripts tointegrate this more tightly.
Dennis

Bartosz Gadzimski wrote:
Sami Siren pisze:
Andrzej Bialecki wrote:
Sami Siren wrote:
I am planning to build the first rc for nutch 1.0 at Tue3.3.2009 morning (EET). There are still some issues marked asfix for 1.0 in Jira. Neither of the two remaining _bugs_ seemstoo important to me, actually I only count the issues assignedto developers as real candidates to be included in 1.0:
NUTCH-578 (kubes)
NUTCH-477 (ab)
NUTCH-669 (siren)
There's one Critical issue reported, related to NekoHTML(NUTCH-700). I'm not sure what are the feature differences(pertinent to Nutch) between 0.9.4 and 1.9.11 - perhapsdowngrading is the safest course of action.
I will take care of that.
I am also volunteering to push all open issues to 1.1 beforestarting the RC build on Tuesday. Any objections on the proposedprocedure or timing?
Sounds good.
great!

--
Sami Siren
What about new scoring and new indexing? Will it be integrated as aprimary scoring algorithm? I have problem with it on LinkRank:
2009-03-02 20:43:45,708 INFO webgraph.LinkRank - Starting linkcounter job2009-03-02 20:43:47,838 INFO webgraph.LinkRank - Finished linkcounter job2009-03-02 20:43:47,839 INFO webgraph.LinkRank - Reading numlinkstemp file2009-03-02 20:43:47,840 INFO webgraph.LinkRank - Deleting numlinkstemp file2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis:java.lang.NullPointerExceptionatorg.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113)atorg.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)atorg.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
atorg.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)
Another question what about indexing framework mentioned here:
http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg11764.html


Have all those new scoring and indexing would be real step forward.

Thanks,
Bartosz

Re: planning for nutch-1.0-rc1

Reply via email to