Ah, that means don't use the crawl command and do a little shell scripting to execute the separte crawl cycle commands, see the nutch wiki for examples. And don't do solrdedup. Search the Solr wiki for deduplication.
cheers On Fri, 11 May 2012 07:39:36 +0300, Tolga <[hidden email]> wrote: > Hi, > > How do I exactly "omit solrdedup and use Solr's internal > deduplication" instead.? I don't even know what any of that means :D > I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/ > -depth 3 -topN 100 to get the error. I have to use all the steps? > > Regards, > > On 05/10/2012 11:38 PM, Markus Jelsma wrote: >> thanks >> >> This is a known issue: >> https://issues.apache.org/jira/browse/NUTCH-1100 >> >> I have not been able find the bug nor do i know how to reproduce it >> from scratch. If you have a public site with which we can reproduce it >> please comment to the Jira ticket. Make sure you use either default >> config or little, a seed URL and the exact crawl & dedup steps to >> reproduce. >> >> If you find it we might fix it. In any case we need to replace the >> dedup command with a more scalable tool which it currently is not. >> >> In the mean time you can omit solrdedup and use Solr's internal >> deduplication instead, it works similar and uses the same signature >> algorithm as Nutch has. Please consult the Solr wiki page on >> deduplication. >> >> Good luck >> >> >> On Thu, 10 May 2012 22:54:37 +0300, Tolga <[hidden email]> wrote: >>> Hi Markus, >>> >>> On 05/10/2012 09:42 AM, Markus Jelsma wrote: >>>> Hi, >>>> >>>> On Thu, 10 May 2012 09:10:04 +0300, Tolga <[hidden email]> wrote: >>>>> Hi, >>>>> >>>>> This will sound like a duplicate, but actually it differs from >>>>> the >>>>> other one. Please bear with me. Following >>>>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the >>>>> command >>>>> >>>>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 >>>>> -topN 5 >>>>> >>>>> Then when I got the message >>>>> >>>>> Exception in thread "main" java.io.IOException: Job failed! >>>>> at >>>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) >>>>> at >>>>> >>>>> >>>>> >>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) >>>>> >>>>> >>>>> at >>>>> >>>>> >>>>> >>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) >>>>> >>>>> >>>>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) >>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >>>> >>>> Please include the relevant part of the log. This can be a known >>>> issue. >>> >>> This is an excerpt from hadoop.log: >>> >>> 2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in: >>> crawl-20120510222629 >>> 2012-05-10 22:26:30,350 INFO crawl.Crawl - rootUrlDir = urls >>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - threads = 10 >>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - depth = 3 >>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - >>> solrUrl=http://localhost:8983/solr/ >>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - topN = 100 >>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: starting >>> at >>> 2012-05-10 22:26:30 >>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: crawlDb: >>> crawl-20120510222629/crawldb >>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: urlDir: >>> urls >>> 2012-05-10 22:26:30,809 INFO crawl.Injector - Injector: Converting >>> injected urls to crawl db entries. >>> 2012-05-10 22:26:34,173 INFO plugin.PluginRepository - Plugins: >>> looking in: /root/apache-nutch-1.4-bin/runtime/local/plugins >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Plugin >>> Auto-activation mode: [true] >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered >>> Plugins: >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - the >>> nutch >>> core extension points (nutch-extensionpoints) >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic >>> URL >>> Normalizer (urlnormalizer-basic) >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Html >>> Parse Plug-in (parse-html) >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic >>> Indexing Filter (index-basic) >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - HTTP >>> Framework (lib-http) >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - >>> Pass-through URL Normalizer (urlnormalizer-pass) >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex >>> URL >>> Filter (urlfilter-regex) >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Http >>> Protocol Plug-in (protocol-http) >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex >>> URL >>> Normalizer (urlnormalizer-regex) >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Tika >>> Parser Plug-in (parse-tika) >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - OPIC >>> Scoring Plug-in (scoring-opic) >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - >>> CyberNeko >>> HTML Parser (lib-nekohtml) >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Anchor >>> Indexing Filter (index-anchor) >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex >>> URL >>> Filter Framework (lib-regex-filter) >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered >>> Extension-Points: >>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch >>> URL >>> Normalizer (org.apache.nutch.net.URLNormalizer) >>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch >>> Protocol (org.apache.nutch.protocol.Protocol) >>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch >>> Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) >>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch >>> URL >>> Filter (org.apache.nutch.net.URLFilter) >>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch >>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter) >>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - HTML >>> Parse Filter (org.apache.nutch.parse.HtmlParseFilter) >>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch >>> Content Parser (org.apache.nutch.parse.Parser) >>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch >>> Scoring (org.apache.nutch.scoring.ScoringFilter) >>> 2012-05-10 22:26:35,439 INFO regex.RegexURLNormalizer - can't find >>> rules for scope 'inject', using default >>> 2012-05-10 22:26:36,434 INFO crawl.Injector - Injector: Merging >>> injected urls into crawl db. >>> 2012-05-10 22:26:36,710 WARN util.NativeCodeLoader - Unable to >>> load >>> native-hadoop library for your platform... using builtin-java >>> classes >>> where applicable >>> 2012-05-10 22:26:37,542 INFO crawl.Injector - Injector: finished >>> at >>> 2012-05-10 22:26:37, elapsed: 00:00:06 >>> 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: starting >>> at 2012-05-10 22:26:37 >>> 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: >>> Selecting >>> best-scoring urls due for fetch. >>> 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: >>> filtering: true >>> 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: >>> normalizing: true >>> 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: topN: >>> 100 >>> 2012-05-10 22:26:37,552 INFO crawl.Generator - Generator: >>> jobtracker >>> is 'local', generating exactly one partition. >>> 2012-05-10 22:26:37,820 INFO crawl.FetchScheduleFactory - Using >>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule >>> 2012-05-10 22:26:37,820 INFO crawl.AbstractFetchSchedule - >>> defaultInterval=2592000 >>> 2012-05-10 22:26:37,820 INFO crawl.AbstractFetchSchedule - >>> maxInterval=7776000 >>> 2012-05-10 22:26:37,856 INFO regex.RegexURLNormalizer - can't find >>> rules for scope 'partition', using default >>> ... >>> ... >>> INFO: [] webapp=/solr path=/update >>> >>> >>> params={waitSearcher=true&waitFlush=true&wt=javabin&commit=true&version=2} >>> >>> status=0 QTime=221 >>> 2012-05-10 22:36:26,336 INFO solr.SolrIndexer - SolrIndexer: >>> finished at 2012-05-10 22:36:26, elapsed: 00:00:05 >>> 2012-05-10 22:36:26,339 INFO solr.SolrDeleteDuplicates - >>> SolrDeleteDuplicates: starting at 2012-05-10 22:36:26 >>> 2012-05-10 22:36:26,339 INFO solr.SolrDeleteDuplicates - >>> SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/ >>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute >>> INFO: [] webapp=/solr path=/select >>> params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=220 >>> status=0 QTime=74 >>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute >>> INFO: [] webapp=/solr path=/select >>> params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=220 >>> status=0 QTime=0 >>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute >>> INFO: [] webapp=/solr path=/select >>> >>> >>> params={fl=id,boost,tstamp,digest&start=0&q=id:[*+TO+*]&wt=javabin&rows=220&version=2} >>> >>> hits=220 status=0 QTime=9 >>> 2012-05-10 22:36:27,656 WARN mapred.LocalJobRunner - >>> job_local_0020 >>> java.lang.NullPointerException >>> at org.apache.hadoop.io.Text.encode(Text.java:388) >>> at org.apache.hadoop.io.Text.set(Text.java:178) >>> at >>> >>> >>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270) >>> >>> at >>> >>> >>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241) >>> >>> at >>> >>> >>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) >>> >>> at >>> >>> >>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) >>> >>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) >>> at >>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) >>> at >>> >>> >>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) >>>> >>>>> >>>>> I issued the commands >>>>> >>>>> bin/nutch crawl urls -dir crawl -depth 3 -topN 5 >>>>> >>>>> and >>>>> >>>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb >>>>> crawldb/linkdb crawldb/segments/* >>>>> >>>>> separately, after which I got no errors. When I browsed to >>>>> http://localhost:8983/solr/admin and attempted a search, I got >>>>> the >>>>> error >>>>> >>>>> >>>>> HTTP ERROR 400 >>>>> >>>>> Problem accessing /solr/select. Reason: >>>>> >>>>> undefined field text >>>> >>>> But this is a Solr thing, you have no field named text. Resolve >>>> this in Solr or on the Solr mailing list. >>>> >>>>> >>>>> http://www.tradeput.com b2b web >>>>> >>>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> /Powered by Jetty:// >>>>> >>>>> /What am I doing wrong? >>>>> >>>>> Regards,/ >>>>> / >>>> >>> Regards, >> ... [show rest of quote] -- http://www.tradeput.com *b2b web* Markus Jelsma - CTO - Openindex -- View this message in context: http://lucene.472066.n3.nabble.com/HTTP-error-400-tp3976225p3984830.html Sent from the Nutch - User mailing list archive at Nabble.com.

