Sorry about the docs being sparse on this. I will write more about the process as time permits. Don't know about the problem below. What platform are you running on, windows, linux?

Dennis

Bartosz Gadzimski wrote:
Hello,

Thanks Dennis for updateing wiki it helped a lot.

You gave example with indexing but you didn't said a bit about it. Can you write some more? :)

Anyways I have problems at the last step (nutch from 07 march):

bin/nutch org.apache.nutch.indexer.field.FieldIndexer

It simply stops somewhere

2009-03-07 16:09:04,432 INFO  field.FieldIndexer - FieldIndexer: starting
2009-03-07 16:09:04,436 INFO field.FieldIndexer - FieldIndexer: adding fields db: crawl/fields/basicfields 2009-03-07 16:09:04,498 INFO field.FieldIndexer - FieldIndexer: adding fields db: crawl/fields/anchorfields 2009-03-07 16:09:05,636 INFO plugin.PluginRepository - Plugins: looking in: /usr/local/nutch/plugins 2009-03-07 16:09:06,437 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Registered Plugins:
2009-03-07 16:09:06,437 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2009-03-07 16:09:06,437 INFO plugin.PluginRepository - Basic Query Filter (query-basic)
.... plugins....

2009-03-07 16:09:07,769 INFO field.FieldIndexer - IFD [Thread-11]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@1b4a74b 2009-03-07 16:09:07,769 INFO field.FieldIndexer - IW 0 [Thread-11]: setInfoStream: dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-root/mapred/local/index/_-884655313 autoCommit=true mergepolicy=org.apache.lucene.index.logbytesizemergepol...@15356d5 mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@69d02b ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 maxFieldLength=10000 index=
2009-03-07 16:09:07,781 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.NullPointerException
at org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139) at org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:131) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410) at org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239) at org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:69)
       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) 2009-03-07 16:09:08,197 FATAL field.FieldIndexer - FieldIndexer: java.io.IOException: Job failed!
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267) at org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)




In crawl/indexes is only _temporary folder.

I will try to debug this but have problems with running nutch in eclipse

Thanks,
Bartosz



Dennis Kubes pisze:
I don't know if I would make this primary yet. I need to check what is causing this as it worked fine for me, in fact we currently have it in production. Also we would need to update the shell scripts to integrate this more tightly.

Dennis

Bartosz Gadzimski wrote:
Sami Siren pisze:
Andrzej Bialecki wrote:
Sami Siren wrote:
I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 morning (EET). There are still some issues marked as fix for 1.0 in Jira. Neither of the two remaining _bugs_ seems too important to me, actually I only count the issues assigned to developers as real candidates to be included in 1.0:

NUTCH-578 (kubes)
NUTCH-477 (ab)
NUTCH-669 (siren)

There's one Critical issue reported, related to NekoHTML (NUTCH-700). I'm not sure what are the feature differences (pertinent to Nutch) between 0.9.4 and 1.9.11 - perhaps downgrading is the safest course of action.
I will take care of that.


I am also volunteering to push all open issues to 1.1 before starting the RC build on Tuesday. Any objections on the proposed procedure or timing?

Sounds good.
great!

--
Sami Siren



What about new scoring and new indexing? Will it be integrated as a primary scoring algorithm? I have problem with it on LinkRank:

2009-03-02 20:43:45,708 INFO webgraph.LinkRank - Starting link counter job 2009-03-02 20:43:47,838 INFO webgraph.LinkRank - Finished link counter job 2009-03-02 20:43:47,839 INFO webgraph.LinkRank - Reading numlinks temp file 2009-03-02 20:43:47,840 INFO webgraph.LinkRank - Deleting numlinks temp file 2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: java.lang.NullPointerException at org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113) at org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582) at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)

Another question what about indexing framework mentioned here:
http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg11764.html


Have all those new scoring and indexing would be real step forward.

Thanks,
Bartosz



Reply via email to