Hi Tejas, Sorry if my questions are confusing :) I have read your post on StackOverflow, and made some clarity for me.
What makes me still didn't understand is how nutch will know when he will not parsed a segment (as appear on "segment already parsed")? Some times I should do more two times to make document (a URL) and its outlinks fetched and parsed by nutch (get more depth). Back to my question. As a simple example is the front page of newspaper online website. If they add 1 (one) news on frontpage, does nutch will create new segment inside crawl/segments directory (e.g. YYYYMMDDMMSSSS format)? Hence, if nutch cannot identify if a page is actually being updated (for above example is frontpage of newspaper online add 1 news / 1 outlink), then should we force nutch to re-fetch the URL? Is it correct? Or we will add -addays option periodically to ensure that we have updated database? Thanks.- On Sat, Jan 12, 2013 at 1:09 PM, Tejas Patil <[email protected]>wrote: > Hi Bayu, > > I did not understand your question properly but I will try to address your > questions as far as I can. > > Generate phase creates a segment which will just have the fetch list (this > is inside the "crawl_generate" directory inside segments). If there are no > urls in the crawldb which are eligible for fetching at that point, then it > will end up creating an empty directory. > > It is during Fetch and Parse phases, the actual data is populated inside > the segments. ([0] is a shameless plug of my answer on StackOverlfow which > has description about the subdirectories inside the segments dir). During > generate or fetch, Nutch cannot identify if a page is actually being > updated at the content owners' end. It will have to re-fetch the > corresponding url. > > Does that answer what you wanted ? > > [0] : > > http://stackoverflow.com/questions/10225239/what-the-outputs-exactly-are-when-integrating-nutch1-4-and-solr/10262243 > > Thanks, > Tejas Patil > > On Fri, Jan 11, 2013 at 5:35 PM, Bayu Widyasanyata > <[email protected]>wrote: > > > Hi, > > > > When "nutch generate" is executed the new segments will create and > somehow > > they would'nt? > > It's when "segment already parsed" generated, in example: > > > > ParseSegment: segment: crawl/segments/20130106091814 Exception in thread > > "main" java.io.IOException: Segment already parsed! > > > > My question is how the new segments is created or how nutch know that the > > page is updated? > > Does it handle by fetching process which know when a page is updated? > > > > Does my analyzing above is correct? > > > > Now, I do "trick" to force the generating of segments by put adddays > > command of nutch. > > > > Thanks, > > > > -- > > wassalam, > > [bayu] > > > -- wassalam, [bayu]

