Re: HTTP error 400

m2000hsf Sat, 19 May 2012 00:50:26 -0700

Ah, that means don't use the crawl command and do a little shell
 scripting to execute the separte crawl cycle commands, see the nutch
 wiki for examples. And don't do solrdedup. Search the Solr wiki for
 deduplication.


 cheers

 On Fri, 11 May 2012 07:39:36 +0300, Tolga <[hidden email]> wrote:

> Hi,
>
> How do I exactly "omit solrdedup and use Solr's internal
> deduplication" instead.? I don't even know what any of that means :D
> I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/
> -depth 3 -topN 100 to get the error. I have to use all the steps?
>
> Regards,
>
> On 05/10/2012 11:38 PM, Markus Jelsma wrote:
>> thanks
>>
>> This is a known issue:
>> https://issues.apache.org/jira/browse/NUTCH-1100
>>
>> I have not been able find the bug nor do i know how to reproduce it
>> from scratch. If you have a public site with which we can reproduce it
>> please comment to the Jira ticket. Make sure you use either default
>> config or little, a seed URL and the exact crawl & dedup steps to
>> reproduce.
>>
>> If you find it we might fix it. In any case we need to replace the
>> dedup command with a more scalable tool which it currently is not.
>>
>> In the mean time you can omit solrdedup and use Solr's internal
>> deduplication instead, it works similar and uses the same signature
>> algorithm as Nutch has. Please consult the Solr wiki page on
>> deduplication.
>>
>> Good luck
>>
>>
>> On Thu, 10 May 2012 22:54:37 +0300, Tolga <[hidden email]> wrote:
>>> Hi Markus,
>>>
>>> On 05/10/2012 09:42 AM, Markus Jelsma wrote:
>>>> Hi,
>>>>
>>>> On Thu, 10 May 2012 09:10:04 +0300, Tolga <[hidden email]> wrote:
>>>>> Hi,
>>>>>
>>>>> This will sound like a duplicate, but actually it differs from
>>>>> the
>>>>> other one. Please bear with me. Following
>>>>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the
>>>>> command
>>>>>
>>>>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3
>>>>> -topN 5
>>>>>
>>>>> Then when I got the message
>>>>>
>>>>> Exception in thread "main" java.io.IOException: Job failed!
>>>>>     at
>>>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
>>>>>     at
>>>>>
>>>>>
>>>>>
>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
>>>>>
>>>>>
>>>>>     at
>>>>>
>>>>>
>>>>>
>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
>>>>>
>>>>>
>>>>>     at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
>>>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>>>>
>>>> Please include the relevant part of the log. This can be a known
>>>> issue.
>>>
>>> This is an excerpt from hadoop.log:
>>>
>>> 2012-05-10 22:26:30,349 INFO  crawl.Crawl - crawl started in:
>>> crawl-20120510222629
>>> 2012-05-10 22:26:30,350 INFO  crawl.Crawl - rootUrlDir = urls
>>> 2012-05-10 22:26:30,351 INFO  crawl.Crawl - threads = 10
>>> 2012-05-10 22:26:30,351 INFO  crawl.Crawl - depth = 3
>>> 2012-05-10 22:26:30,351 INFO  crawl.Crawl -
>>> solrUrl=http://localhost:8983/solr/
>>> 2012-05-10 22:26:30,351 INFO  crawl.Crawl - topN = 100
>>> 2012-05-10 22:26:30,750 INFO  crawl.Injector - Injector: starting
>>> at
>>> 2012-05-10 22:26:30
>>> 2012-05-10 22:26:30,750 INFO  crawl.Injector - Injector: crawlDb:
>>> crawl-20120510222629/crawldb
>>> 2012-05-10 22:26:30,750 INFO  crawl.Injector - Injector: urlDir:
>>> urls
>>> 2012-05-10 22:26:30,809 INFO  crawl.Injector - Injector: Converting
>>> injected urls to crawl db entries.
>>> 2012-05-10 22:26:34,173 INFO  plugin.PluginRepository - Plugins:
>>> looking in: /root/apache-nutch-1.4-bin/runtime/local/plugins
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository - Registered
>>> Plugins:
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     the
>>> nutch
>>> core extension points (nutch-extensionpoints)
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Basic
>>> URL
>>> Normalizer (urlnormalizer-basic)
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Html
>>> Parse Plug-in (parse-html)
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Basic
>>> Indexing Filter (index-basic)
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     HTTP
>>> Framework (lib-http)
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -
>>> Pass-through URL Normalizer (urlnormalizer-pass)
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Regex
>>> URL
>>> Filter (urlfilter-regex)
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Http
>>> Protocol Plug-in (protocol-http)
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Regex
>>> URL
>>> Normalizer (urlnormalizer-regex)
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Tika
>>> Parser Plug-in (parse-tika)
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     OPIC
>>> Scoring Plug-in (scoring-opic)
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -    
>>> CyberNeko
>>> HTML Parser (lib-nekohtml)
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Anchor
>>> Indexing Filter (index-anchor)
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Regex
>>> URL
>>> Filter Framework (lib-regex-filter)
>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
>>> URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
>>> Protocol (org.apache.nutch.protocol.Protocol)
>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
>>> Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
>>> URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     HTML
>>> Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
>>> Content Parser (org.apache.nutch.parse.Parser)
>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
>>> Scoring (org.apache.nutch.scoring.ScoringFilter)
>>> 2012-05-10 22:26:35,439 INFO  regex.RegexURLNormalizer - can't find
>>> rules for scope 'inject', using default
>>> 2012-05-10 22:26:36,434 INFO  crawl.Injector - Injector: Merging
>>> injected urls into crawl db.
>>> 2012-05-10 22:26:36,710 WARN  util.NativeCodeLoader - Unable to
>>> load
>>> native-hadoop library for your platform... using builtin-java
>>> classes
>>> where applicable
>>> 2012-05-10 22:26:37,542 INFO  crawl.Injector - Injector: finished
>>> at
>>> 2012-05-10 22:26:37, elapsed: 00:00:06
>>> 2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator: starting
>>> at 2012-05-10 22:26:37
>>> 2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator:
>>> Selecting
>>> best-scoring urls due for fetch.
>>> 2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator:
>>> filtering: true
>>> 2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator:
>>> normalizing: true
>>> 2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator: topN:
>>> 100
>>> 2012-05-10 22:26:37,552 INFO  crawl.Generator - Generator:
>>> jobtracker
>>> is 'local', generating exactly one partition.
>>> 2012-05-10 22:26:37,820 INFO  crawl.FetchScheduleFactory - Using
>>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>> 2012-05-10 22:26:37,820 INFO  crawl.AbstractFetchSchedule -
>>> defaultInterval=2592000
>>> 2012-05-10 22:26:37,820 INFO  crawl.AbstractFetchSchedule -
>>> maxInterval=7776000
>>> 2012-05-10 22:26:37,856 INFO  regex.RegexURLNormalizer - can't find
>>> rules for scope 'partition', using default
>>> ...
>>> ...
>>> INFO: [] webapp=/solr path=/update
>>>
>>>
>>> params={waitSearcher=true&waitFlush=true&wt=javabin&commit=true&version=2}
>>>
>>> status=0 QTime=221
>>> 2012-05-10 22:36:26,336 INFO  solr.SolrIndexer - SolrIndexer:
>>> finished at 2012-05-10 22:36:26, elapsed: 00:00:05
>>> 2012-05-10 22:36:26,339 INFO  solr.SolrDeleteDuplicates -
>>> SolrDeleteDuplicates: starting at 2012-05-10 22:36:26
>>> 2012-05-10 22:36:26,339 INFO  solr.SolrDeleteDuplicates -
>>> SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
>>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
>>> INFO: [] webapp=/solr path=/select
>>> params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=220
>>> status=0 QTime=74
>>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
>>> INFO: [] webapp=/solr path=/select
>>> params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=220
>>> status=0 QTime=0
>>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
>>> INFO: [] webapp=/solr path=/select
>>>
>>>
>>> params={fl=id,boost,tstamp,digest&start=0&q=id:[*+TO+*]&wt=javabin&rows=220&version=2}
>>>
>>> hits=220 status=0 QTime=9
>>> 2012-05-10 22:36:27,656 WARN  mapred.LocalJobRunner -
>>> job_local_0020
>>> java.lang.NullPointerException
>>>     at org.apache.hadoop.io.Text.encode(Text.java:388)
>>>     at org.apache.hadoop.io.Text.set(Text.java:178)
>>>     at
>>>
>>>
>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270)
>>>
>>>     at
>>>
>>>
>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241)
>>>
>>>     at
>>>
>>>
>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>>>
>>>     at
>>>
>>>
>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>>>
>>>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>>>     at
>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>     at
>>>
>>>
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>>>
>>>>>
>>>>> I issued the commands
>>>>>
>>>>> bin/nutch crawl urls -dir crawl -depth 3 -topN 5
>>>>>
>>>>> and
>>>>>
>>>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb
>>>>> crawldb/linkdb crawldb/segments/*
>>>>>
>>>>> separately, after which I got no errors. When I browsed to
>>>>> http://localhost:8983/solr/admin and attempted a search, I got
>>>>> the
>>>>> error
>>>>>
>>>>>
>>>>>    HTTP ERROR 400
>>>>>
>>>>> Problem accessing /solr/select. Reason:
>>>>>
>>>>>     undefined field text
>>>>
>>>> But this is a Solr thing, you have no field named text. Resolve
>>>> this in Solr or on the Solr mailing list.
>>>>
>>>>>
>>>>>   http://www.tradeput.com b2b web 
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> /Powered by Jetty://
>>>>>
>>>>> /What am I doing wrong?
>>>>>
>>>>> Regards,/
>>>>> /
>>>>
>>> Regards,
>>
... [show rest of quote]

--  http://www.tradeput.com *b2b web*  
 Markus Jelsma - CTO - Openindex 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/HTTP-error-400-tp3976225p3984830.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: HTTP error 400

Reply via email to