On Sat, 2009-04-04 at 20:18 +0800, Mingfai wrote: > hi, > > I think I got a better picture of Droids now and have learnt things beyond > the Simple Runtime including the more advanced GaussianRandomDelayTimer and > SimpleTaskQueueWithHistory. It seems to me the SimpleTaskQueue is not useful > for most web crawling scenario as pages are usually linked to each others, > and SimpleTaskQueueWithHistory is very useful.
Yeah, I agree: http://markmail.org/thread/5t2dyozc2d3l2no2 > > AFAIK, there is no mechanism that cater the re-crawling scenario. I wonder > if anyone has idea on: > > - how to determine a page/URL is changed? > - follow cache and expiry date in the HTTP header That should be the starting point. Requesting the header is normally fast and reliable. > - Size, plus and minus 5-15% Not sure about that since this seems pretty hacky, since the internal text could have changed entirely but the absolute size would be the same. > - Text change detection algothmn, such as Myer's diff algorithm (i > only know the name :-) and i'm not sure if it is really meaningful to do > detection in this way) > http://code.google.com/p/google-diff-match-patch/ The problem with this is that you actually have to request the response body to compare it with the page on your system. That only makes sense when the handler stage is cosuming a lot of time/resources to be invoked. e.g. the helloCrawler requests a page and saves it to disk. When we now compare the http responseBody with the page it may be faster to just save it again. > > - when to implement the detection logic in Droids? > - We could have a Task Validator to check the fetch history and maybe > reject the task if the expiry time is not over yet. This is the > first level > of change detection. Agree, there should be an expires/changed validator like you describe. > - At the parse time, as the content is first accessed, one could > implement a parser that do change detection. That could serve as second level change detection, but with above said, that in some cases the benefit does not justify the extra work. > > For both of the above case, there is a problem that the ContentEntity > doesn't contain the full set of HTTP Header. (at least, HTTP headers that > are relevant to change detection) Should all HTTP Headers be stored in the > ContentEntity? Yes, that makes sense. However we need to implement it hybrid, since we have FileContentEntity and HttpContentEntity. I mean ALL headers just make sense for HttpContentEntity, right? salu2 > > Regards, > mingfai -- Thorsten Scherler <thorsten.at.apache.org> Open Source Java <consulting, training and solutions> Sociedad Andaluza para el Desarrollo de la Sociedad de la Información, S.A.U. (SADESI)
