Re: planning for nutch-1.0-rc1

Bartosz Gadzimski Sun, 08 Mar 2009 10:27:25 -0700

Hello,

Thanks Dennis for updateing wiki it helped a lot.

You gave example with indexing but you didn't said a bit about it. Canyou write some more? :)


Anyways I have problems at the last step (nutch from 07 march):

bin/nutch org.apache.nutch.indexer.field.FieldIndexer

It simply stops somewhere

2009-03-07 16:09:04,432 INFO  field.FieldIndexer - FieldIndexer: starting
2009-03-07 16:09:04,436 INFO  field.FieldIndexer - FieldIndexer: adding fields 
db: crawl/fields/basicfields
2009-03-07 16:09:04,498 INFO  field.FieldIndexer - FieldIndexer: adding fields 
db: crawl/fields/anchorfields
2009-03-07 16:09:05,636 INFO  plugin.PluginRepository - Plugins: looking in: 
/usr/local/nutch/plugins
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Plugin Auto-activation 
mode: [true]
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Registered Plugins:
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         the nutch core 
extension points (nutch-extensionpoints)
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         Basic Query 
Filter (query-basic)
.... plugins....

2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IFD [Thread-11]: 
setInfoStream 
deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@1b4a74b
2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IW 0 [Thread-11]: 
setInfoStream: 
dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-root/mapred/local/index/_-884655313
 autoCommit=true 
mergepolicy=org.apache.lucene.index.logbytesizemergepol...@15356d5 
mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@69d02b 
ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 
maxFieldLength=10000 index=
2009-03-07 16:09:07,781 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.NullPointerException
       at 
org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139)
       at 
org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:131)
       at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
       at 
org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239)
       at 
org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:69)
       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
       at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
2009-03-07 16:09:08,197 FATAL field.FieldIndexer - FieldIndexer: 
java.io.IOException: Job failed!
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
       at 
org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)
       at org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
       at 
org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)




In crawl/indexes is only _temporary folder.

I will try to debug this but have problems with running nutch in eclipse

Thanks,
Bartosz



Dennis Kubes pisze:

I don't know if I would make this primary yet. I need to check whatis causing this as it worked fine for me, in fact we currently have itin production. Also we would need to update the shell scripts tointegrate this more tightly.
Dennis

Bartosz Gadzimski wrote:
Sami Siren pisze:
Andrzej Bialecki wrote:
Sami Siren wrote:
I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009morning (EET). There are still some issues marked as fix for 1.0in Jira. Neither of the two remaining _bugs_ seems too importantto me, actually I only count the issues assigned to developers asreal candidates to be included in 1.0:
NUTCH-578 (kubes)
NUTCH-477 (ab)
NUTCH-669 (siren)
There's one Critical issue reported, related to NekoHTML(NUTCH-700). I'm not sure what are the feature differences(pertinent to Nutch) between 0.9.4 and 1.9.11 - perhaps downgradingis the safest course of action.
I will take care of that.
I am also volunteering to push all open issues to 1.1 beforestarting the RC build on Tuesday. Any objections on the proposedprocedure or timing?
Sounds good.
great!

--
Sami Siren
What about new scoring and new indexing? Will it be integrated as aprimary scoring algorithm? I have problem with it on LinkRank:
2009-03-02 20:43:45,708 INFO webgraph.LinkRank - Starting linkcounter job2009-03-02 20:43:47,838 INFO webgraph.LinkRank - Finished linkcounter job2009-03-02 20:43:47,839 INFO webgraph.LinkRank - Reading numlinkstemp file2009-03-02 20:43:47,840 INFO webgraph.LinkRank - Deleting numlinkstemp file2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis:java.lang.NullPointerExceptionatorg.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113)atorg.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)atorg.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
atorg.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)
Another question what about indexing framework mentioned here:
http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg11764.html


Have all those new scoring and indexing would be real step forward.

Thanks,
Bartosz

Re: planning for nutch-1.0-rc1

Reply via email to