Yes. Also take a look at this page [1]<http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script> for script exemples.
[1] http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script On Thu, May 17, 2012 at 6:07 AM, Tolga <[email protected]> wrote: > I'm still confused. You mean to use http://wiki.apache.org/nutch/** > NutchTutorial#A3.2_Using_**Individual_Commands_for_Whole-**Web_Crawling<http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling>? > > > On 5/15/12 2:05 PM, Markus Jelsma wrote: > >> Please follow the step-by-step tutorial, it's explained there: >> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial> >> >> On Tuesday 15 May 2012 13:40:26 Tolga wrote: >> >>> I'm a little confused. How can I not use the crawl command and execute >>> the separate crawl cycle commands at the same time? >>> >>> Regards, >>> >>> On 5/11/12 9:40 AM, Markus Jelsma wrote: >>> >>>> Ah, that means don't use the crawl command and do a little shell >>>> scripting to execute the separte crawl cycle commands, see the nutch >>>> wiki for examples. And don't do solrdedup. Search the Solr wiki for >>>> deduplication. >>>> >>>> cheers >>>> >>>> On Fri, 11 May 2012 07:39:36 +0300, Tolga<[email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> How do I exactly "omit solrdedup and use Solr's internal >>>>> deduplication" instead.? I don't even know what any of that means :D >>>>> I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/ >>>>> -depth 3 -topN 100 to get the error. I have to use all the steps? >>>>> >>>>> Regards, >>>>> >>>>> On 05/10/2012 11:38 PM, Markus Jelsma wrote: >>>>> >>>>>> thanks >>>>>> >>>>>> This is a known issue: >>>>>> https://issues.apache.org/**jira/browse/NUTCH-1100<https://issues.apache.org/jira/browse/NUTCH-1100> >>>>>> >>>>>> I have not been able find the bug nor do i know how to reproduce it >>>>>> from scratch. If you have a public site with which we can reproduce >>>>>> it please comment to the Jira ticket. Make sure you use either >>>>>> default config or little, a seed URL and the exact crawl& dedup >>>>>> >>>>>> steps to reproduce. >>>>>> >>>>>> If you find it we might fix it. In any case we need to replace the >>>>>> dedup command with a more scalable tool which it currently is not. >>>>>> >>>>>> In the mean time you can omit solrdedup and use Solr's internal >>>>>> deduplication instead, it works similar and uses the same signature >>>>>> algorithm as Nutch has. Please consult the Solr wiki page on >>>>>> deduplication. >>>>>> >>>>>> Good luck >>>>>> >>>>>> On Thu, 10 May 2012 22:54:37 +0300, Tolga<[email protected]> wrote: >>>>>> >>>>>>> Hi Markus, >>>>>>> >>>>>>> On 05/10/2012 09:42 AM, Markus Jelsma wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> On Thu, 10 May 2012 09:10:04 +0300, Tolga<[email protected]> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> This will sound like a duplicate, but actually it differs from the >>>>>>>>> other one. Please bear with me. Following >>>>>>>>> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>, >>>>>>>>> I first issued the >>>>>>>>> command >>>>>>>>> >>>>>>>>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 >>>>>>>>> -topN 5 >>>>>>>>> >>>>>>>>> Then when I got the message >>>>>>>>> >>>>>>>>> Exception in thread "main" java.io.IOException: Job failed! >>>>>>>>> >>>>>>>>> at >>>>>>>>> >>>>>>>>> org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**java:1252) >>>>>>>>> >>>>>>>>> at >>>>>>>>> >>>>>>>>> org.apache.nutch.indexer.solr.**SolrDeleteDuplicates.dedup(** >>>>>>>>> SolrDeleteDu >>>>>>>>> plicates.java:373)>>>>>> >>>>>>>>> at >>>>>>>>> >>>>>>>>> org.apache.nutch.indexer.solr.**SolrDeleteDuplicates.dedup(** >>>>>>>>> SolrDeleteDu >>>>>>>>> plicates.java:353)>>>>>> >>>>>>>>> at org.apache.nutch.crawl.Crawl.**run(Crawl.java:153) >>>>>>>>> at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.** >>>>>>>>> java:65) >>>>>>>>> at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55) >>>>>>>>> >>>>>>>> Please include the relevant part of the log. This can be a known >>>>>>>> issue. >>>>>>>> >>>>>>> This is an excerpt from hadoop.log: >>>>>>> >>>>>>> 2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in: >>>>>>> crawl-20120510222629 >>>>>>> 2012-05-10 22:26:30,350 INFO crawl.Crawl - rootUrlDir = urls >>>>>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - threads = 10 >>>>>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - depth = 3 >>>>>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - >>>>>>> solrUrl=http://localhost:8983/**solr/ <http://localhost:8983/solr/> >>>>>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - topN = 100 >>>>>>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: starting at >>>>>>> 2012-05-10 22:26:30 >>>>>>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: crawlDb: >>>>>>> crawl-20120510222629/crawldb >>>>>>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: urlDir: urls >>>>>>> 2012-05-10 22:26:30,809 INFO crawl.Injector - Injector: Converting >>>>>>> injected urls to crawl db entries. >>>>>>> 2012-05-10 22:26:34,173 INFO plugin.PluginRepository - Plugins: >>>>>>> looking in: /root/apache-nutch-1.4-bin/**runtime/local/plugins >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Plugin >>>>>>> Auto-activation mode: [true] >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered >>>>>>> Plugins: >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - the nutch >>>>>>> core extension points (nutch-extensionpoints) >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic URL >>>>>>> Normalizer (urlnormalizer-basic) >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Html >>>>>>> Parse Plug-in (parse-html) >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic >>>>>>> Indexing Filter (index-basic) >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - HTTP >>>>>>> Framework (lib-http) >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - >>>>>>> Pass-through URL Normalizer (urlnormalizer-pass) >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL >>>>>>> Filter (urlfilter-regex) >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Http >>>>>>> Protocol Plug-in (protocol-http) >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL >>>>>>> Normalizer (urlnormalizer-regex) >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Tika >>>>>>> Parser Plug-in (parse-tika) >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - OPIC >>>>>>> Scoring Plug-in (scoring-opic) >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - CyberNeko >>>>>>> HTML Parser (lib-nekohtml) >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Anchor >>>>>>> Indexing Filter (index-anchor) >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL >>>>>>> Filter Framework (lib-regex-filter) >>>>>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered >>>>>>> Extension-Points: >>>>>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch URL >>>>>>> Normalizer (org.apache.nutch.net.**URLNormalizer) >>>>>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch >>>>>>> Protocol (org.apache.nutch.protocol.**Protocol) >>>>>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch >>>>>>> Segment Merge Filter (org.apache.nutch.segment.**SegmentMergeFilter) >>>>>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch URL >>>>>>> Filter (org.apache.nutch.net.**URLFilter) >>>>>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch >>>>>>> Indexing Filter (org.apache.nutch.indexer.**IndexingFilter) >>>>>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - HTML >>>>>>> Parse Filter (org.apache.nutch.parse.**HtmlParseFilter) >>>>>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch >>>>>>> Content Parser (org.apache.nutch.parse.**Parser) >>>>>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch >>>>>>> Scoring (org.apache.nutch.scoring.**ScoringFilter) >>>>>>> 2012-05-10 22:26:35,439 INFO regex.RegexURLNormalizer - can't find >>>>>>> rules for scope 'inject', using default >>>>>>> 2012-05-10 22:26:36,434 INFO crawl.Injector - Injector: Merging >>>>>>> injected urls into crawl db. >>>>>>> 2012-05-10 22:26:36,710 WARN util.NativeCodeLoader - Unable to load >>>>>>> native-hadoop library for your platform... using builtin-java classes >>>>>>> where applicable >>>>>>> 2012-05-10 22:26:37,542 INFO crawl.Injector - Injector: finished at >>>>>>> 2012-05-10 22:26:37, elapsed: 00:00:06 >>>>>>> 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: starting >>>>>>> at 2012-05-10 22:26:37 >>>>>>> 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: Selecting >>>>>>> best-scoring urls due for fetch. >>>>>>> 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: >>>>>>> filtering: true >>>>>>> 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: >>>>>>> normalizing: true >>>>>>> 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: topN: 100 >>>>>>> 2012-05-10 22:26:37,552 INFO crawl.Generator - Generator: jobtracker >>>>>>> is 'local', generating exactly one partition. >>>>>>> 2012-05-10 22:26:37,820 INFO crawl.FetchScheduleFactory - Using >>>>>>> FetchSchedule impl: org.apache.nutch.crawl.**DefaultFetchSchedule >>>>>>> 2012-05-10 22:26:37,820 INFO crawl.AbstractFetchSchedule - >>>>>>> defaultInterval=2592000 >>>>>>> 2012-05-10 22:26:37,820 INFO crawl.AbstractFetchSchedule - >>>>>>> maxInterval=7776000 >>>>>>> 2012-05-10 22:26:37,856 INFO regex.RegexURLNormalizer - can't find >>>>>>> rules for scope 'partition', using default >>>>>>> ... >>>>>>> ... >>>>>>> INFO: [] webapp=/solr path=/update >>>>>>> >>>>>>> >>>>>>> params={waitSearcher=true&**waitFlush=true&wt=javabin&** >>>>>>> commit=true&version >>>>>>> =2} >>>>>>> >>>>>>> >>>>>>> status=0 QTime=221 >>>>>>> 2012-05-10 22:36:26,336 INFO solr.SolrIndexer - SolrIndexer: >>>>>>> finished at 2012-05-10 22:36:26, elapsed: 00:00:05 >>>>>>> 2012-05-10 22:36:26,339 INFO solr.SolrDeleteDuplicates - >>>>>>> SolrDeleteDuplicates: starting at 2012-05-10 22:36:26 >>>>>>> 2012-05-10 22:36:26,339 INFO solr.SolrDeleteDuplicates - >>>>>>> SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/ >>>>>>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute >>>>>>> INFO: [] webapp=/solr path=/select >>>>>>> params={fl=id&wt=javabin&q=id:**[*+TO+*]&rows=1&version=2} hits=220 >>>>>>> status=0 QTime=74 >>>>>>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute >>>>>>> INFO: [] webapp=/solr path=/select >>>>>>> params={fl=id&wt=javabin&q=id:**[*+TO+*]&rows=1&version=2} hits=220 >>>>>>> status=0 QTime=0 >>>>>>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute >>>>>>> INFO: [] webapp=/solr path=/select >>>>>>> >>>>>>> >>>>>>> params={fl=id,boost,tstamp,**digest&start=0&q=id:[*+TO+*]&** >>>>>>> wt=javabin&rows >>>>>>> =220&version=2} >>>>>>> >>>>>>> >>>>>>> hits=220 status=0 QTime=9 >>>>>>> 2012-05-10 22:36:27,656 WARN mapred.LocalJobRunner - job_local_0020 >>>>>>> java.lang.NullPointerException >>>>>>> >>>>>>> at org.apache.hadoop.io.Text.**encode(Text.java:388) >>>>>>> at org.apache.hadoop.io.Text.set(**Text.java:178) >>>>>>> at >>>>>>> >>>>>>> org.apache.nutch.indexer.solr.**SolrDeleteDuplicates$** >>>>>>> SolrInputFormat$1.ne >>>>>>> xt(SolrDeleteDuplicates.java:**270)>>>> >>>>>>> at >>>>>>> >>>>>>> org.apache.nutch.indexer.solr.**SolrDeleteDuplicates$** >>>>>>> SolrInputFormat$1.ne >>>>>>> xt(SolrDeleteDuplicates.java:**241)>>>> >>>>>>> at >>>>>>> >>>>>>> org.apache.hadoop.mapred.**MapTask$TrackedRecordReader.** >>>>>>> moveToNext(MapTask >>>>>>> .java:192)>>>> >>>>>>> at >>>>>>> >>>>>>> org.apache.hadoop.mapred.**MapTask$TrackedRecordReader.** >>>>>>> next(MapTask.java: >>>>>>> 176) >>>>>>> >>>>>>> at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**48) >>>>>>> at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.** >>>>>>> java:358) >>>>>>> at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:307) >>>>>>> at >>>>>>> >>>>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(** >>>>>>> LocalJobRunner.java:177 >>>>>>> ) >>>>>>> >>>>>>> I issued the commands >>>>>>>>> >>>>>>>>> bin/nutch crawl urls -dir crawl -depth 3 -topN 5 >>>>>>>>> >>>>>>>>> and >>>>>>>>> >>>>>>>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb >>>>>>>>> crawldb/linkdb crawldb/segments/* >>>>>>>>> >>>>>>>>> separately, after which I got no errors. When I browsed to >>>>>>>>> http://localhost:8983/solr/**admin<http://localhost:8983/solr/admin>and >>>>>>>>> attempted a search, I got the >>>>>>>>> error >>>>>>>>> >>>>>>>>> HTTP ERROR 400 >>>>>>>>> >>>>>>>>> Problem accessing /solr/select. Reason: >>>>>>>>> undefined field text >>>>>>>>> >>>>>>>> But this is a Solr thing, you have no field named text. Resolve >>>>>>>> this in Solr or on the Solr mailing list. >>>>>>>> >>>>>>>> ------------------------------**------------------------------** >>>>>>>>> --------- >>>>>>>>> --- >>>>>>>>> >>>>>>>>> >>>>>>>>> /Powered by Jetty:// >>>>>>>>> >>>>>>>>> /What am I doing wrong? >>>>>>>>> >>>>>>>>> Regards,/ >>>>>>>>> / >>>>>>>>> >>>>>>>> Regards, >>>>>>> >>>>>> -- Jean-François Gingras

