Hi Markus,
On 05/10/2012 09:42 AM, Markus Jelsma wrote:
Hi,
On Thu, 10 May 2012 09:10:04 +0300, Tolga <to...@ozses.net> wrote:
Hi,
This will sound like a duplicate, but actually it differs from
the
other one. Please bear with me. Following
http://wiki.apache.org/nutch/NutchTutorial, I first issued the
command
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3
-topN 5
Then when I got the message
Exception in thread "main" java.io.IOException: Job failed!
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Please include the relevant part of the log. This can be a known
issue.
This is an excerpt from hadoop.log:
2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in:
crawl-20120510222629
2012-05-10 22:26:30,350 INFO crawl.Crawl - rootUrlDir = urls
2012-05-10 22:26:30,351 INFO crawl.Crawl - threads = 10
2012-05-10 22:26:30,351 INFO crawl.Crawl - depth = 3
2012-05-10 22:26:30,351 INFO crawl.Crawl -
solrUrl=http://localhost:8983/solr/
2012-05-10 22:26:30,351 INFO crawl.Crawl - topN = 100
2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: starting
at
2012-05-10 22:26:30
2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: crawlDb:
crawl-20120510222629/crawldb
2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: urlDir:
urls
2012-05-10 22:26:30,809 INFO crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2012-05-10 22:26:34,173 INFO plugin.PluginRepository - Plugins:
looking in: /root/apache-nutch-1.4-bin/runtime/local/plugins
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered
Plugins:
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - the
nutch
core extension points (nutch-extensionpoints)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic
URL
Normalizer (urlnormalizer-basic)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex
URL
Filter (urlfilter-regex)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex
URL
Normalizer (urlnormalizer-regex)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Tika
Parser Plug-in (parse-tika)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository -
CyberNeko
HTML Parser (lib-nekohtml)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Anchor
Indexing Filter (index-anchor)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex
URL
Filter Framework (lib-regex-filter)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered
Extension-Points:
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch
URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch
Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch
URL
Filter (org.apache.nutch.net.URLFilter)
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2012-05-10 22:26:35,439 INFO regex.RegexURLNormalizer - can't find
rules for scope 'inject', using default
2012-05-10 22:26:36,434 INFO crawl.Injector - Injector: Merging
injected urls into crawl db.
2012-05-10 22:26:36,710 WARN util.NativeCodeLoader - Unable to
load
native-hadoop library for your platform... using builtin-java
classes
where applicable
2012-05-10 22:26:37,542 INFO crawl.Injector - Injector: finished
at
2012-05-10 22:26:37, elapsed: 00:00:06
2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: starting
at 2012-05-10 22:26:37
2012-05-10 22:26:37,551 INFO crawl.Generator - Generator:
Selecting
best-scoring urls due for fetch.
2012-05-10 22:26:37,551 INFO crawl.Generator - Generator:
filtering: true
2012-05-10 22:26:37,551 INFO crawl.Generator - Generator:
normalizing: true
2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: topN:
100
2012-05-10 22:26:37,552 INFO crawl.Generator - Generator:
jobtracker
is 'local', generating exactly one partition.
2012-05-10 22:26:37,820 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-05-10 22:26:37,820 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-05-10 22:26:37,820 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-05-10 22:26:37,856 INFO regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
...
...
INFO: [] webapp=/solr path=/update
params={waitSearcher=true&waitFlush=true&wt=javabin&commit=true&version=2}
status=0 QTime=221
2012-05-10 22:36:26,336 INFO solr.SolrIndexer - SolrIndexer:
finished at 2012-05-10 22:36:26, elapsed: 00:00:05
2012-05-10 22:36:26,339 INFO solr.SolrDeleteDuplicates -
SolrDeleteDuplicates: starting at 2012-05-10 22:36:26
2012-05-10 22:36:26,339 INFO solr.SolrDeleteDuplicates -
SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=220
status=0 QTime=74
May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=220
status=0 QTime=0
May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={fl=id,boost,tstamp,digest&start=0&q=id:[*+TO+*]&wt=javabin&rows=220&version=2}
hits=220 status=0 QTime=9
2012-05-10 22:36:27,656 WARN mapred.LocalJobRunner -
job_local_0020
java.lang.NullPointerException
at org.apache.hadoop.io.Text.encode(Text.java:388)
at org.apache.hadoop.io.Text.set(Text.java:178)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
I issued the commands
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
and
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb
crawldb/linkdb crawldb/segments/*
separately, after which I got no errors. When I browsed to
http://localhost:8983/solr/admin and attempted a search, I got
the
error
HTTP ERROR 400
Problem accessing /solr/select. Reason:
undefined field text
But this is a Solr thing, you have no field named text. Resolve
this in Solr or on the Solr mailing list.
------------------------------------------------------------------------
/Powered by Jetty://
/What am I doing wrong?
Regards,/
/
Regards,