Ah, that means don't use the crawl command and do a little shell scripting to execute the separte crawl cycle commands, see the nutch wiki for examples. And don't do solrdedup. Search the Solr wiki for deduplication.

cheers

On Fri, 11 May 2012 07:39:36 +0300, Tolga <to...@ozses.net> wrote:
Hi,

How do I exactly "omit solrdedup and use Solr's internal
deduplication" instead.? I don't even know what any of that means :D
I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/
-depth 3 -topN 100 to get the error. I have to use all the steps?

Regards,

On 05/10/2012 11:38 PM, Markus Jelsma wrote:
thanks

This is a known issue:
https://issues.apache.org/jira/browse/NUTCH-1100

I have not been able find the bug nor do i know how to reproduce it from scratch. If you have a public site with which we can reproduce it please comment to the Jira ticket. Make sure you use either default config or little, a seed URL and the exact crawl & dedup steps to reproduce.

If you find it we might fix it. In any case we need to replace the dedup command with a more scalable tool which it currently is not.

In the mean time you can omit solrdedup and use Solr's internal deduplication instead, it works similar and uses the same signature algorithm as Nutch has. Please consult the Solr wiki page on deduplication.

Good luck


On Thu, 10 May 2012 22:54:37 +0300, Tolga <to...@ozses.net> wrote:
Hi Markus,

On 05/10/2012 09:42 AM, Markus Jelsma wrote:
Hi,

On Thu, 10 May 2012 09:10:04 +0300, Tolga <to...@ozses.net> wrote:
Hi,

This will sound like a duplicate, but actually it differs from the
other one. Please bear with me. Following
http://wiki.apache.org/nutch/NutchTutorial, I first issued the command

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

Then when I got the message

Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
    at



org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)


    at



org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)


    at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

Please include the relevant part of the log. This can be a known issue.

This is an excerpt from hadoop.log:

2012-05-10 22:26:30,349 INFO  crawl.Crawl - crawl started in:
crawl-20120510222629
2012-05-10 22:26:30,350 INFO  crawl.Crawl - rootUrlDir = urls
2012-05-10 22:26:30,351 INFO  crawl.Crawl - threads = 10
2012-05-10 22:26:30,351 INFO  crawl.Crawl - depth = 3
2012-05-10 22:26:30,351 INFO  crawl.Crawl -
solrUrl=http://localhost:8983/solr/
2012-05-10 22:26:30,351 INFO  crawl.Crawl - topN = 100
2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: starting at
2012-05-10 22:26:30
2012-05-10 22:26:30,750 INFO  crawl.Injector - Injector: crawlDb:
crawl-20120510222629/crawldb
2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: urlDir: urls
2012-05-10 22:26:30,809 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2012-05-10 22:26:34,173 INFO  plugin.PluginRepository - Plugins:
looking in: /root/apache-nutch-1.4-bin/runtime/local/plugins
2012-05-10 22:26:34,962 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered Plugins: 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - the nutch
core extension points (nutch-extensionpoints)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Html
Parse Plug-in (parse-html)
2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Basic
Indexing Filter (index-basic)
2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     HTTP
Framework (lib-http)
2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Http
Protocol Plug-in (protocol-http)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Tika
Parser Plug-in (parse-tika)
2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     OPIC
Scoring Plug-in (scoring-opic)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - CyberNeko
HTML Parser (lib-nekohtml)
2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Anchor
Indexing Filter (index-anchor)
2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2012-05-10 22:26:34,962 INFO  plugin.PluginRepository - Registered
Extension-Points:
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
Content Parser (org.apache.nutch.parse.Parser)
2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2012-05-10 22:26:35,439 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'inject', using default
2012-05-10 22:26:36,434 INFO  crawl.Injector - Injector: Merging
injected urls into crawl db.
2012-05-10 22:26:36,710 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes
where applicable
2012-05-10 22:26:37,542 INFO crawl.Injector - Injector: finished at
2012-05-10 22:26:37, elapsed: 00:00:06
2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator: starting
at 2012-05-10 22:26:37
2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: filtering: true 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: normalizing: true 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: topN: 100 2012-05-10 22:26:37,552 INFO crawl.Generator - Generator: jobtracker
is 'local', generating exactly one partition.
2012-05-10 22:26:37,820 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-05-10 22:26:37,820 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-05-10 22:26:37,820 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-05-10 22:26:37,856 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
...
...
INFO: [] webapp=/solr path=/update


params={waitSearcher=true&waitFlush=true&wt=javabin&commit=true&version=2}

status=0 QTime=221
2012-05-10 22:36:26,336 INFO  solr.SolrIndexer - SolrIndexer:
finished at 2012-05-10 22:36:26, elapsed: 00:00:05
2012-05-10 22:36:26,339 INFO  solr.SolrDeleteDuplicates -
SolrDeleteDuplicates: starting at 2012-05-10 22:36:26
2012-05-10 22:36:26,339 INFO  solr.SolrDeleteDuplicates -
SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=220
status=0 QTime=74
May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=220
status=0 QTime=0
May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select


params={fl=id,boost,tstamp,digest&start=0&q=id:[*+TO+*]&wt=javabin&rows=220&version=2}

hits=220 status=0 QTime=9
2012-05-10 22:36:27,656 WARN mapred.LocalJobRunner - job_local_0020
java.lang.NullPointerException
    at org.apache.hadoop.io.Text.encode(Text.java:388)
    at org.apache.hadoop.io.Text.set(Text.java:178)
    at


org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270)

    at


org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241)

    at


org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)

    at


org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)

    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at


org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)


I issued the commands

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

and

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb
crawldb/linkdb crawldb/segments/*

separately, after which I got no errors. When I browsed to
http://localhost:8983/solr/admin and attempted a search, I got the
error


   HTTP ERROR 400

Problem accessing /solr/select. Reason:

    undefined field text

But this is a Solr thing, you have no field named text. Resolve this in Solr or on the Solr mailing list.





------------------------------------------------------------------------

/Powered by Jetty://

/What am I doing wrong?

Regards,/
/

Regards,


--
Markus Jelsma - CTO - Openindex

Reply via email to