So the question is - how can I recrawl a single url while running
>>> >>>> nutch
>>> >>>> crawl on an existing input.
>>> >>>
>>> >>> Ah i see. Use the FreeGenerator tool to generate a segment from a
>>> >>> plain
>>&g
gt;>
>> >>>>
>> >>>> Any other insights on these issues will be appreciated
>> >>>>
>> >>>>
>> >>>>
>> >>>> Markus Jelsma-2 wrote
>> >>>>>
>> >>>>> hi,
>> >>>>>
&
9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
> >>>>> <nutch.buddy@> wrote:
> >>>>>> Hi
> >>>>>> There are some scenarios of failure in nutch which I'm not sure
> >>>>>> how
> >>&
gt;>> how
>>>>>> to
>>>>>> handle.
>>>>>>
>>>>>> 1. I run nutch on a huge amount of urls and some kind of OOM
>>>>>> exception if
>>>>>> thrown, or one of those "
orked and doesnt have
the
ones
that didnt work.
How can I handle them without having to recrawl the whole thing?
I don't understand.
Thanks.
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
Sent from the N
irely.
>>>
>>>> How can I recover from this? Do I have to recrawl all the urls that
>>>> were in
>>>> the segment?
>>>> If so, how do I mark them for recrawl in crawldb?
>>>
>>> You can just generate new segments or remove all directo
d some urls are not
parsed
sucessfully.
I get an index which has all the urls that worked and doesnt have
the
ones
that didnt work.
How can I handle them without having to recrawl the whole thing?
I don't understand.
Thanks.
--
View this message in context:
http://lucene.47
_generate and start fetching again.
>
> 2. I run nutch on a huge amount of urls and some urls are not parsed
> sucessfully.
> I get an index which has all the urls that worked and doesnt have the
> ones
> that didnt work.
> How can I handle them without having to recrawl the
rk.
How can I handle them without having to recrawl the whole thing?
I don't understand.
Thanks.
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
Sent from the Nutch - User mailing list archive at Nabble.com.
n I handle them without having to recrawl the whole thing?
Thanks.
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
Sent from the Nutch - User mailing list archive at Nabble.com.
10 matches
Mail list logo