Hi Bayu, I did not understand your question properly but I will try to address your questions as far as I can.
Generate phase creates a segment which will just have the fetch list (this is inside the "crawl_generate" directory inside segments). If there are no urls in the crawldb which are eligible for fetching at that point, then it will end up creating an empty directory. It is during Fetch and Parse phases, the actual data is populated inside the segments. ([0] is a shameless plug of my answer on StackOverlfow which has description about the subdirectories inside the segments dir). During generate or fetch, Nutch cannot identify if a page is actually being updated at the content owners' end. It will have to re-fetch the corresponding url. Does that answer what you wanted ? [0] : http://stackoverflow.com/questions/10225239/what-the-outputs-exactly-are-when-integrating-nutch1-4-and-solr/10262243 Thanks, Tejas Patil On Fri, Jan 11, 2013 at 5:35 PM, Bayu Widyasanyata <[email protected]>wrote: > Hi, > > When "nutch generate" is executed the new segments will create and somehow > they would'nt? It's when "segment already parsed" generated, in example: > > ParseSegment: segment: crawl/segments/20130106091814 Exception in thread > "main" java.io.IOException: Segment already parsed! > > My question is how the new segments is created or how nutch know that the > page is updated? > Does it handle by fetching process which know when a page is updated? > > Does my analyzing above is correct? > > Now, I do "trick" to force the generating of segments by put adddays > command of nutch. > > Thanks, > > -- > wassalam, > [bayu] >

