Re: How segments is created?

Tejas Patil Fri, 11 Jan 2013 22:10:29 -0800

Hi Bayu,

I did not understand your question properly but I will try to address your
questions as far as I can.

Generate phase creates a segment which will just have the fetch list (this
is inside the "crawl_generate" directory inside segments). If there are no
urls in the crawldb which are eligible for fetching at that point, then it
will end up creating an empty directory.

It is during Fetch and Parse phases, the actual data is populated inside
the segments. ([0] is a shameless plug of my answer on StackOverlfow which
has description about the subdirectories inside the segments dir). During
generate or fetch, Nutch cannot identify if a page is actually being
updated at the content owners' end. It will have to re-fetch the
corresponding url.

Does that answer what you wanted ?

[0] :
http://stackoverflow.com/questions/10225239/what-the-outputs-exactly-are-when-integrating-nutch1-4-and-solr/10262243

Thanks,
Tejas Patil

On Fri, Jan 11, 2013 at 5:35 PM, Bayu Widyasanyata
<[email protected]>wrote:

> Hi,
>
> When "nutch generate" is executed the new segments will create and somehow
> they would'nt?

It's when "segment already parsed" generated, in example:
>
> ParseSegment: segment: crawl/segments/20130106091814 Exception in thread
> "main" java.io.IOException: Segment already parsed!
>
> My question is how the new segments is created or how nutch know that the
> page is updated?
> Does it handle by fetching process which know when a page is updated?
>
> Does my analyzing above is correct?
>
> Now, I do "trick" to force the generating of segments by put adddays
> command of nutch.
>
> Thanks,
>
> --
> wassalam,
> [bayu]
>

Re: How segments is created?

Reply via email to