Yes. Also take a look at this page
[1]<http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script>
for
script exemples.

[1]
http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script

On Thu, May 17, 2012 at 6:07 AM, Tolga <[email protected]> wrote:

> I'm still confused. You mean to use http://wiki.apache.org/nutch/**
> NutchTutorial#A3.2_Using_**Individual_Commands_for_Whole-**Web_Crawling<http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling>?
>
>
> On 5/15/12 2:05 PM, Markus Jelsma wrote:
>
>> Please follow the step-by-step tutorial, it's explained there:
>> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>
>>
>> On Tuesday 15 May 2012 13:40:26 Tolga wrote:
>>
>>> I'm a little confused. How can I not use the crawl command and execute
>>> the separate crawl cycle commands at the same time?
>>>
>>> Regards,
>>>
>>> On 5/11/12 9:40 AM, Markus Jelsma wrote:
>>>
>>>> Ah, that means don't use the crawl command and do a little shell
>>>> scripting to execute the separte crawl cycle commands, see the nutch
>>>> wiki for examples. And don't do solrdedup. Search the Solr wiki for
>>>> deduplication.
>>>>
>>>> cheers
>>>>
>>>> On Fri, 11 May 2012 07:39:36 +0300, Tolga<[email protected]>  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> How do I exactly "omit solrdedup and use Solr's internal
>>>>> deduplication" instead.? I don't even know what any of that means :D
>>>>> I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/
>>>>> -depth 3 -topN 100 to get the error. I have to use all the steps?
>>>>>
>>>>> Regards,
>>>>>
>>>>> On 05/10/2012 11:38 PM, Markus Jelsma wrote:
>>>>>
>>>>>> thanks
>>>>>>
>>>>>> This is a known issue:
>>>>>> https://issues.apache.org/**jira/browse/NUTCH-1100<https://issues.apache.org/jira/browse/NUTCH-1100>
>>>>>>
>>>>>> I have not been able find the bug nor do i know how to reproduce it
>>>>>> from scratch. If you have a public site with which we can reproduce
>>>>>> it please comment to the Jira ticket. Make sure you use either
>>>>>> default config or little, a seed URL and the exact crawl&  dedup
>>>>>>
>>>>>> steps to reproduce.
>>>>>>
>>>>>> If you find it we might fix it. In any case we need to replace the
>>>>>> dedup command with a more scalable tool which it currently is not.
>>>>>>
>>>>>> In the mean time you can omit solrdedup and use Solr's internal
>>>>>> deduplication instead, it works similar and uses the same signature
>>>>>> algorithm as Nutch has. Please consult the Solr wiki page on
>>>>>> deduplication.
>>>>>>
>>>>>> Good luck
>>>>>>
>>>>>> On Thu, 10 May 2012 22:54:37 +0300, Tolga<[email protected]>  wrote:
>>>>>>
>>>>>>> Hi Markus,
>>>>>>>
>>>>>>> On 05/10/2012 09:42 AM, Markus Jelsma wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On Thu, 10 May 2012 09:10:04 +0300, Tolga<[email protected]>  wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> This will sound like a duplicate, but actually it differs from the
>>>>>>>>> other one. Please bear with me. Following
>>>>>>>>> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>,
>>>>>>>>> I first issued the
>>>>>>>>> command
>>>>>>>>>
>>>>>>>>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3
>>>>>>>>> -topN 5
>>>>>>>>>
>>>>>>>>> Then when I got the message
>>>>>>>>>
>>>>>>>>> Exception in thread "main" java.io.IOException: Job failed!
>>>>>>>>>
>>>>>>>>>     at
>>>>>>>>>
>>>>>>>>> org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**java:1252)
>>>>>>>>>
>>>>>>>>>     at
>>>>>>>>>
>>>>>>>>> org.apache.nutch.indexer.solr.**SolrDeleteDuplicates.dedup(**
>>>>>>>>> SolrDeleteDu
>>>>>>>>> plicates.java:373)>>>>>>
>>>>>>>>>     at
>>>>>>>>>
>>>>>>>>> org.apache.nutch.indexer.solr.**SolrDeleteDuplicates.dedup(**
>>>>>>>>> SolrDeleteDu
>>>>>>>>> plicates.java:353)>>>>>>
>>>>>>>>>     at org.apache.nutch.crawl.Crawl.**run(Crawl.java:153)
>>>>>>>>>     at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**
>>>>>>>>> java:65)
>>>>>>>>>     at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55)
>>>>>>>>>
>>>>>>>> Please include the relevant part of the log. This can be a known
>>>>>>>> issue.
>>>>>>>>
>>>>>>> This is an excerpt from hadoop.log:
>>>>>>>
>>>>>>> 2012-05-10 22:26:30,349 INFO  crawl.Crawl - crawl started in:
>>>>>>> crawl-20120510222629
>>>>>>> 2012-05-10 22:26:30,350 INFO  crawl.Crawl - rootUrlDir = urls
>>>>>>> 2012-05-10 22:26:30,351 INFO  crawl.Crawl - threads = 10
>>>>>>> 2012-05-10 22:26:30,351 INFO  crawl.Crawl - depth = 3
>>>>>>> 2012-05-10 22:26:30,351 INFO  crawl.Crawl -
>>>>>>> solrUrl=http://localhost:8983/**solr/ <http://localhost:8983/solr/>
>>>>>>> 2012-05-10 22:26:30,351 INFO  crawl.Crawl - topN = 100
>>>>>>> 2012-05-10 22:26:30,750 INFO  crawl.Injector - Injector: starting at
>>>>>>> 2012-05-10 22:26:30
>>>>>>> 2012-05-10 22:26:30,750 INFO  crawl.Injector - Injector: crawlDb:
>>>>>>> crawl-20120510222629/crawldb
>>>>>>> 2012-05-10 22:26:30,750 INFO  crawl.Injector - Injector: urlDir: urls
>>>>>>> 2012-05-10 22:26:30,809 INFO  crawl.Injector - Injector: Converting
>>>>>>> injected urls to crawl db entries.
>>>>>>> 2012-05-10 22:26:34,173 INFO  plugin.PluginRepository - Plugins:
>>>>>>> looking in: /root/apache-nutch-1.4-bin/**runtime/local/plugins
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository - Plugin
>>>>>>> Auto-activation mode: [true]
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository - Registered
>>>>>>> Plugins:
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     the nutch
>>>>>>> core extension points (nutch-extensionpoints)
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Basic URL
>>>>>>> Normalizer (urlnormalizer-basic)
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Html
>>>>>>> Parse Plug-in (parse-html)
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Basic
>>>>>>> Indexing Filter (index-basic)
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     HTTP
>>>>>>> Framework (lib-http)
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -
>>>>>>> Pass-through URL Normalizer (urlnormalizer-pass)
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Regex URL
>>>>>>> Filter (urlfilter-regex)
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Http
>>>>>>> Protocol Plug-in (protocol-http)
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Regex URL
>>>>>>> Normalizer (urlnormalizer-regex)
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Tika
>>>>>>> Parser Plug-in (parse-tika)
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     OPIC
>>>>>>> Scoring Plug-in (scoring-opic)
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     CyberNeko
>>>>>>> HTML Parser (lib-nekohtml)
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Anchor
>>>>>>> Indexing Filter (index-anchor)
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Regex URL
>>>>>>> Filter Framework (lib-regex-filter)
>>>>>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository - Registered
>>>>>>> Extension-Points:
>>>>>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch URL
>>>>>>> Normalizer (org.apache.nutch.net.**URLNormalizer)
>>>>>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
>>>>>>> Protocol (org.apache.nutch.protocol.**Protocol)
>>>>>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
>>>>>>> Segment Merge Filter (org.apache.nutch.segment.**SegmentMergeFilter)
>>>>>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch URL
>>>>>>> Filter (org.apache.nutch.net.**URLFilter)
>>>>>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
>>>>>>> Indexing Filter (org.apache.nutch.indexer.**IndexingFilter)
>>>>>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     HTML
>>>>>>> Parse Filter (org.apache.nutch.parse.**HtmlParseFilter)
>>>>>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
>>>>>>> Content Parser (org.apache.nutch.parse.**Parser)
>>>>>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
>>>>>>> Scoring (org.apache.nutch.scoring.**ScoringFilter)
>>>>>>> 2012-05-10 22:26:35,439 INFO  regex.RegexURLNormalizer - can't find
>>>>>>> rules for scope 'inject', using default
>>>>>>> 2012-05-10 22:26:36,434 INFO  crawl.Injector - Injector: Merging
>>>>>>> injected urls into crawl db.
>>>>>>> 2012-05-10 22:26:36,710 WARN  util.NativeCodeLoader - Unable to load
>>>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>>>> where applicable
>>>>>>> 2012-05-10 22:26:37,542 INFO  crawl.Injector - Injector: finished at
>>>>>>> 2012-05-10 22:26:37, elapsed: 00:00:06
>>>>>>> 2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator: starting
>>>>>>> at 2012-05-10 22:26:37
>>>>>>> 2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator: Selecting
>>>>>>> best-scoring urls due for fetch.
>>>>>>> 2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator:
>>>>>>> filtering: true
>>>>>>> 2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator:
>>>>>>> normalizing: true
>>>>>>> 2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator: topN: 100
>>>>>>> 2012-05-10 22:26:37,552 INFO  crawl.Generator - Generator: jobtracker
>>>>>>> is 'local', generating exactly one partition.
>>>>>>> 2012-05-10 22:26:37,820 INFO  crawl.FetchScheduleFactory - Using
>>>>>>> FetchSchedule impl: org.apache.nutch.crawl.**DefaultFetchSchedule
>>>>>>> 2012-05-10 22:26:37,820 INFO  crawl.AbstractFetchSchedule -
>>>>>>> defaultInterval=2592000
>>>>>>> 2012-05-10 22:26:37,820 INFO  crawl.AbstractFetchSchedule -
>>>>>>> maxInterval=7776000
>>>>>>> 2012-05-10 22:26:37,856 INFO  regex.RegexURLNormalizer - can't find
>>>>>>> rules for scope 'partition', using default
>>>>>>> ...
>>>>>>> ...
>>>>>>> INFO: [] webapp=/solr path=/update
>>>>>>>
>>>>>>>
>>>>>>> params={waitSearcher=true&**waitFlush=true&wt=javabin&**
>>>>>>> commit=true&version
>>>>>>> =2}
>>>>>>>
>>>>>>>
>>>>>>> status=0 QTime=221
>>>>>>> 2012-05-10 22:36:26,336 INFO  solr.SolrIndexer - SolrIndexer:
>>>>>>> finished at 2012-05-10 22:36:26, elapsed: 00:00:05
>>>>>>> 2012-05-10 22:36:26,339 INFO  solr.SolrDeleteDuplicates -
>>>>>>> SolrDeleteDuplicates: starting at 2012-05-10 22:36:26
>>>>>>> 2012-05-10 22:36:26,339 INFO  solr.SolrDeleteDuplicates -
>>>>>>> SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
>>>>>>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
>>>>>>> INFO: [] webapp=/solr path=/select
>>>>>>> params={fl=id&wt=javabin&q=id:**[*+TO+*]&rows=1&version=2} hits=220
>>>>>>> status=0 QTime=74
>>>>>>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
>>>>>>> INFO: [] webapp=/solr path=/select
>>>>>>> params={fl=id&wt=javabin&q=id:**[*+TO+*]&rows=1&version=2} hits=220
>>>>>>> status=0 QTime=0
>>>>>>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
>>>>>>> INFO: [] webapp=/solr path=/select
>>>>>>>
>>>>>>>
>>>>>>> params={fl=id,boost,tstamp,**digest&start=0&q=id:[*+TO+*]&**
>>>>>>> wt=javabin&rows
>>>>>>> =220&version=2}
>>>>>>>
>>>>>>>
>>>>>>> hits=220 status=0 QTime=9
>>>>>>> 2012-05-10 22:36:27,656 WARN  mapred.LocalJobRunner - job_local_0020
>>>>>>> java.lang.NullPointerException
>>>>>>>
>>>>>>>     at org.apache.hadoop.io.Text.**encode(Text.java:388)
>>>>>>>     at org.apache.hadoop.io.Text.set(**Text.java:178)
>>>>>>>     at
>>>>>>>
>>>>>>> org.apache.nutch.indexer.solr.**SolrDeleteDuplicates$**
>>>>>>> SolrInputFormat$1.ne
>>>>>>> xt(SolrDeleteDuplicates.java:**270)>>>>
>>>>>>>     at
>>>>>>>
>>>>>>> org.apache.nutch.indexer.solr.**SolrDeleteDuplicates$**
>>>>>>> SolrInputFormat$1.ne
>>>>>>> xt(SolrDeleteDuplicates.java:**241)>>>>
>>>>>>>     at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.**MapTask$TrackedRecordReader.**
>>>>>>> moveToNext(MapTask
>>>>>>> .java:192)>>>>
>>>>>>>     at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.**MapTask$TrackedRecordReader.**
>>>>>>> next(MapTask.java:
>>>>>>> 176)
>>>>>>>
>>>>>>>     at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**48)
>>>>>>>     at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**
>>>>>>> java:358)
>>>>>>>     at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:307)
>>>>>>>     at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>>>>>>> LocalJobRunner.java:177
>>>>>>> )
>>>>>>>
>>>>>>>  I issued the commands
>>>>>>>>>
>>>>>>>>> bin/nutch crawl urls -dir crawl -depth 3 -topN 5
>>>>>>>>>
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb
>>>>>>>>> crawldb/linkdb crawldb/segments/*
>>>>>>>>>
>>>>>>>>> separately, after which I got no errors. When I browsed to
>>>>>>>>> http://localhost:8983/solr/**admin<http://localhost:8983/solr/admin>and
>>>>>>>>>  attempted a search, I got the
>>>>>>>>> error
>>>>>>>>>
>>>>>>>>>    HTTP ERROR 400
>>>>>>>>>
>>>>>>>>> Problem accessing /solr/select. Reason:
>>>>>>>>>     undefined field text
>>>>>>>>>
>>>>>>>> But this is a Solr thing, you have no field named text. Resolve
>>>>>>>> this in Solr or on the Solr mailing list.
>>>>>>>>
>>>>>>>>  ------------------------------**------------------------------**
>>>>>>>>> ---------
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> /Powered by Jetty://
>>>>>>>>>
>>>>>>>>> /What am I doing wrong?
>>>>>>>>>
>>>>>>>>> Regards,/
>>>>>>>>> /
>>>>>>>>>
>>>>>>>> Regards,
>>>>>>>
>>>>>>


-- 
Jean-François Gingras

Reply via email to