Re: How to handle failures in nutch?

2012-04-18 Thread nutch.bu...@gmail.com
So the question is - how can I recrawl a single url while running >>> >>>> nutch >>> >>>> crawl on an existing input. >>> >>> >>> >>> Ah i see. Use the FreeGenerator tool to generate a segment from a >>> >>> plain >>&g

Re: How to handle failures in nutch?

2012-04-10 Thread nutch.bu...@gmail.com
gt;> >> >>>> >> >>>> Any other insights on these issues will be appreciated >> >>>> >> >>>> >> >>>> >> >>>> Markus Jelsma-2 wrote >> >>>>> >> >>>>> hi, >> >>>>> &

Re: How to handle failures in nutch?

2012-04-10 Thread remi tassing
9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" > >>>>> <nutch.buddy@> wrote: > >>>>>> Hi > >>>>>> There are some scenarios of failure in nutch which I'm not sure > >>>>>> how > >>&

Re: How to handle failures in nutch?

2012-04-10 Thread nutch.bu...@gmail.com
gt;>> how >>>>>> to >>>>>> handle. >>>>>> >>>>>> 1. I run nutch on a huge amount of urls and some kind of OOM >>>>>> exception if >>>>>> thrown, or one of those "

Re: How to handle failures in nutch?

2012-04-10 Thread Markus Jelsma
orked and doesnt have the ones that didnt work. How can I handle them without having to recrawl the whole thing? I don't understand. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html Sent from the N

Re: How to handle failures in nutch?

2012-04-10 Thread nutch.bu...@gmail.com
irely. >>> >>>> How can I recover from this? Do I have to recrawl all the urls that >>>> were in >>>> the segment? >>>> If so, how do I mark them for recrawl in crawldb? >>> >>> You can just generate new segments or remove all directo

Re: How to handle failures in nutch?

2012-04-10 Thread Markus Jelsma
d some urls are not parsed sucessfully. I get an index which has all the urls that worked and doesnt have the ones that didnt work. How can I handle them without having to recrawl the whole thing? I don't understand. Thanks. -- View this message in context: http://lucene.47

Re: How to handle failures in nutch?

2012-04-10 Thread nutch.bu...@gmail.com
_generate and start fetching again. > > 2. I run nutch on a huge amount of urls and some urls are not parsed > sucessfully. > I get an index which has all the urls that worked and doesnt have the > ones > that didnt work. > How can I handle them without having to recrawl the

Re: How to handle failures in nutch?

2012-04-09 Thread Markus Jelsma
rk. How can I handle them without having to recrawl the whole thing? I don't understand. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html Sent from the Nutch - User mailing list archive at Nabble.com.

How to handle failures in nutch?

2012-04-09 Thread nutch.bu...@gmail.com
n I handle them without having to recrawl the whole thing? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html Sent from the Nutch - User mailing list archive at Nabble.com.