Hi Markus,

On 05/10/2012 09:42 AM, Markus Jelsma wrote:
Hi,

On Thu, 10 May 2012 09:10:04 +0300, Tolga <[email protected]> wrote:
Hi,

This will sound like a duplicate, but actually it differs from the
other one. Please bear with me. Following
http://wiki.apache.org/nutch/NutchTutorial, I first issued the command

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

Then when I got the message

Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
    at

org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
    at

org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

Please include the relevant part of the log. This can be a known issue.

This is an excerpt from hadoop.log:

2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in: crawl-20120510222629
2012-05-10 22:26:30,350 INFO  crawl.Crawl - rootUrlDir = urls
2012-05-10 22:26:30,351 INFO  crawl.Crawl - threads = 10
2012-05-10 22:26:30,351 INFO  crawl.Crawl - depth = 3
2012-05-10 22:26:30,351 INFO crawl.Crawl - solrUrl=http://localhost:8983/solr/
2012-05-10 22:26:30,351 INFO  crawl.Crawl - topN = 100
2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: starting at 2012-05-10 22:26:30 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: crawlDb: crawl-20120510222629/crawldb
2012-05-10 22:26:30,750 INFO  crawl.Injector - Injector: urlDir: urls
2012-05-10 22:26:30,809 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2012-05-10 22:26:34,173 INFO plugin.PluginRepository - Plugins: looking in: /root/apache-nutch-1.4-bin/runtime/local/plugins 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2012-05-10 22:26:34,962 INFO  plugin.PluginRepository - Registered Plugins:
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Tika Parser Plug-in (parse-tika) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered Extension-Points: 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2012-05-10 22:26:35,439 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2012-05-10 22:26:36,434 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2012-05-10 22:26:36,710 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2012-05-10 22:26:37,542 INFO crawl.Injector - Injector: finished at 2012-05-10 22:26:37, elapsed: 00:00:06 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: starting at 2012-05-10 22:26:37 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch.
2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator: filtering: true
2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator: normalizing: true
2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator: topN: 100
2012-05-10 22:26:37,552 INFO crawl.Generator - Generator: jobtracker is 'local', generating exactly one partition. 2012-05-10 22:26:37,820 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2012-05-10 22:26:37,820 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2012-05-10 22:26:37,820 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2012-05-10 22:26:37,856 INFO regex.RegexURLNormalizer - can't find rules for scope 'partition', using default
...
...
INFO: [] webapp=/solr path=/update params={waitSearcher=true&waitFlush=true&wt=javabin&commit=true&version=2} status=0 QTime=221 2012-05-10 22:36:26,336 INFO solr.SolrIndexer - SolrIndexer: finished at 2012-05-10 22:36:26, elapsed: 00:00:05 2012-05-10 22:36:26,339 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: starting at 2012-05-10 22:36:26 2012-05-10 22:36:26,339 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=220 status=0 QTime=74
May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=220 status=0 QTime=0
May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select params={fl=id,boost,tstamp,digest&start=0&q=id:[*+TO+*]&wt=javabin&rows=220&version=2} hits=220 status=0 QTime=9
2012-05-10 22:36:27,656 WARN  mapred.LocalJobRunner - job_local_0020
java.lang.NullPointerException
    at org.apache.hadoop.io.Text.encode(Text.java:388)
    at org.apache.hadoop.io.Text.set(Text.java:178)
at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)


I issued the commands

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

and

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb
crawldb/linkdb crawldb/segments/*

separately, after which I got no errors. When I browsed to
http://localhost:8983/solr/admin and attempted a search, I got the
error


   HTTP ERROR 400

Problem accessing /solr/select. Reason:

    undefined field text

But this is a Solr thing, you have no field named text. Resolve this in Solr or on the Solr mailing list.



------------------------------------------------------------------------
/Powered by Jetty://

/What am I doing wrong?

Regards,/
/

Regards,

Reply via email to