Hi Tejas,
Sorry if my questions are confusing :)

I have read your post on StackOverflow, and made some clarity for me.

What makes me still didn't understand is how nutch will know when he will
not parsed a segment (as appear on "segment already parsed")?
Some times I should do more two times to make document (a URL) and its
outlinks fetched and parsed by nutch (get more depth).

Back to my question.
As a simple example is the front page of newspaper online website.
If they add 1 (one) news on frontpage, does nutch will create new segment
inside crawl/segments directory (e.g. YYYYMMDDMMSSSS format)?

Hence, if nutch cannot identify if a page is actually being updated (for
above example is frontpage of newspaper online add 1 news / 1 outlink),
then should we force nutch to re-fetch the URL? Is it correct?
Or we will add -addays option periodically to ensure that we have updated
database?

Thanks.-

On Sat, Jan 12, 2013 at 1:09 PM, Tejas Patil <[email protected]>wrote:

> Hi Bayu,
>
> I did not understand your question properly but I will try to address your
> questions as far as I can.
>
> Generate phase creates a segment which will just have the fetch list (this
> is inside the "crawl_generate" directory inside segments). If there are no
> urls in the crawldb which are eligible for fetching at that point, then it
> will end up creating an empty directory.
>
> It is during Fetch and Parse phases, the actual data is populated inside
> the segments. ([0] is a shameless plug of my answer on StackOverlfow which
> has description about the subdirectories inside the segments dir). During
> generate or fetch, Nutch cannot identify if a page is actually being
> updated at the content owners' end. It will have to re-fetch the
> corresponding url.
>
> Does that answer what you wanted ?
>
> [0] :
>
> http://stackoverflow.com/questions/10225239/what-the-outputs-exactly-are-when-integrating-nutch1-4-and-solr/10262243
>
> Thanks,
> Tejas Patil
>
> On Fri, Jan 11, 2013 at 5:35 PM, Bayu Widyasanyata
> <[email protected]>wrote:
>
> > Hi,
> >
> > When "nutch generate" is executed the new segments will create and
> somehow
> > they would'nt?
>
> It's when "segment already parsed" generated, in example:
> >
> > ParseSegment: segment: crawl/segments/20130106091814 Exception in thread
> > "main" java.io.IOException: Segment already parsed!
> >
> > My question is how the new segments is created or how nutch know that the
> > page is updated?
> > Does it handle by fetching process which know when a page is updated?
> >
> > Does my analyzing above is correct?
> >
> > Now, I do "trick" to force the generating of segments by put adddays
> > command of nutch.
> >
> > Thanks,
> >
> > --
> > wassalam,
> > [bayu]
> >
>



-- 
wassalam,
[bayu]

Reply via email to