Re: Re-crawling scenario and HTTP Headers

Thorsten Scherler Mon, 06 Apr 2009 00:59:05 -0700

On Sat, 2009-04-04 at 20:18 +0800, Mingfai wrote:
> hi,
> 
> I think I got a better picture of Droids now and have learnt things beyond
> the Simple Runtime including the more advanced GaussianRandomDelayTimer and
> SimpleTaskQueueWithHistory. It seems to me the SimpleTaskQueue is not useful
> for most web crawling scenario as pages are usually linked to each others,
> and SimpleTaskQueueWithHistory is very useful.


Yeah, I agree:
http://markmail.org/thread/5t2dyozc2d3l2no2

> 
> AFAIK, there is no mechanism that cater the re-crawling scenario. I wonder
> if anyone has idea on:
> 
>    - how to determine a page/URL is changed?
>       - follow cache and expiry date in the HTTP header

That should be the starting point. Requesting the header is normally
fast and reliable. 

>       - Size, plus and minus 5-15%

Not sure about that since this seems pretty hacky, since the internal
text could have changed entirely but the absolute size would be the
same. 

>       - Text change detection algothmn, such as  Myer's diff algorithm (i
>       only know the name :-) and i'm not sure if it is really meaningful to do
>       detection in this way)
>       http://code.google.com/p/google-diff-match-patch/

The problem with this is that you actually have to request the response
body to compare it with the page on your system. That only makes sense
when the handler stage is cosuming a lot of time/resources to be
invoked.

e.g. the helloCrawler requests a page and saves it to disk. When we now
compare the http responseBody with the page it may be faster to just
save it again.

> 
>       - when to implement the detection logic in Droids?
>    - We could have a Task Validator to check the fetch history and maybe
>       reject the task if the expiry time is not over yet. This is the
> first level
>       of change detection.

Agree, there should be an expires/changed validator like you describe.

>       - At the parse time, as the content is first accessed, one could
>       implement a parser that do change detection.

That could serve as second level change detection, but with above said,
that in some cases the benefit does not justify the extra work.

> 
> For both of the above case, there is a problem that the ContentEntity
> doesn't contain the full set of HTTP Header. (at least, HTTP headers that
> are relevant to change detection) Should all HTTP Headers be stored in the
> ContentEntity?

Yes, that makes sense. However we need to implement it hybrid, since we
have FileContentEntity and HttpContentEntity. I mean ALL headers just
make sense for HttpContentEntity, right?

salu2

> 
> Regards,
> mingfai
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)

Re: Re-crawling scenario and HTTP Headers

Reply via email to