Hi Markus,
On 05/10/2012 09:42 AM, Markus Jelsma wrote:
Hi,
On Thu, 10 May 2012 09:10:04 +0300, Tolga <[email protected]> wrote:
Hi,
This will sound like a duplicate, but actually it differs from the
other one. Please bear with me. Following
http://wiki.apache.org/nutch/NutchTutorial, I first issued the command
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
Then when I got the message
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Please include the relevant part of the log. This can be a known issue.
This is an excerpt from hadoop.log:
2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in:
crawl-20120510222629
2012-05-10 22:26:30,350 INFO crawl.Crawl - rootUrlDir = urls
2012-05-10 22:26:30,351 INFO crawl.Crawl - threads = 10
2012-05-10 22:26:30,351 INFO crawl.Crawl - depth = 3
2012-05-10 22:26:30,351 INFO crawl.Crawl -
solrUrl=http://localhost:8983/solr/
2012-05-10 22:26:30,351 INFO crawl.Crawl - topN = 100
2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: starting at
2012-05-10 22:26:30
2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: crawlDb:
crawl-20120510222629/crawldb
2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: urlDir: urls
2012-05-10 22:26:30,809 INFO crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2012-05-10 22:26:34,173 INFO plugin.PluginRepository - Plugins: looking
in: /root/apache-nutch-1.4-bin/runtime/local/plugins
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered Plugins:
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - the nutch
core extension points (nutch-extensionpoints)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Pass-through
URL Normalizer (urlnormalizer-pass)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Tika Parser
Plug-in (parse-tika)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - CyberNeko
HTML Parser (lib-nekohtml)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Anchor
Indexing Filter (index-anchor)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered
Extension-Points:
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch
Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2012-05-10 22:26:35,439 INFO regex.RegexURLNormalizer - can't find
rules for scope 'inject', using default
2012-05-10 22:26:36,434 INFO crawl.Injector - Injector: Merging
injected urls into crawl db.
2012-05-10 22:26:36,710 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2012-05-10 22:26:37,542 INFO crawl.Injector - Injector: finished at
2012-05-10 22:26:37, elapsed: 00:00:06
2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: starting at
2012-05-10 22:26:37
2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: filtering: true
2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: normalizing: true
2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: topN: 100
2012-05-10 22:26:37,552 INFO crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2012-05-10 22:26:37,820 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-05-10 22:26:37,820 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-05-10 22:26:37,820 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-05-10 22:26:37,856 INFO regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
...
...
INFO: [] webapp=/solr path=/update
params={waitSearcher=true&waitFlush=true&wt=javabin&commit=true&version=2}
status=0 QTime=221
2012-05-10 22:36:26,336 INFO solr.SolrIndexer - SolrIndexer: finished
at 2012-05-10 22:36:26, elapsed: 00:00:05
2012-05-10 22:36:26,339 INFO solr.SolrDeleteDuplicates -
SolrDeleteDuplicates: starting at 2012-05-10 22:36:26
2012-05-10 22:36:26,339 INFO solr.SolrDeleteDuplicates -
SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=220
status=0 QTime=74
May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=220
status=0 QTime=0
May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={fl=id,boost,tstamp,digest&start=0&q=id:[*+TO+*]&wt=javabin&rows=220&version=2}
hits=220 status=0 QTime=9
2012-05-10 22:36:27,656 WARN mapred.LocalJobRunner - job_local_0020
java.lang.NullPointerException
at org.apache.hadoop.io.Text.encode(Text.java:388)
at org.apache.hadoop.io.Text.set(Text.java:178)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
I issued the commands
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
and
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb
crawldb/linkdb crawldb/segments/*
separately, after which I got no errors. When I browsed to
http://localhost:8983/solr/admin and attempted a search, I got the
error
HTTP ERROR 400
Problem accessing /solr/select. Reason:
undefined field text
But this is a Solr thing, you have no field named text. Resolve this
in Solr or on the Solr mailing list.
------------------------------------------------------------------------
/Powered by Jetty://
/What am I doing wrong?
Regards,/
/
Regards,