On Mon, 2005-03-14 at 20:48 +0100, Dawid Weiss wrote:
> I think what they do at Google is a fancy heuristic -- as David Spencer 
> mentioned, suburls of a given page, identical snippets, or titles... My 
> idea was more towards providing a 'realistic overview' of subjects in 
> pages. So you could pick, say, the first document from each cluster and 
> show them like that to the user. Then, in every cluster documents 
> already have mutual similarity (this could be calculated manually, the 
> clustering algorithm doesn't do it for all pairs of documents), but some 
> have more and some have less. You could then hide nearly identical 
> results from the user.
> 
> Anyway, I think the Google method is just a heuristic based on URLs and 
> nothing as fancy.

At the moment I need something quite simple. To identify a page that
appears in many forms, e.g.:

- Normal version
- Split across several pages
- Print version
- From a different section (different styling and navigation elements)

Basically identical content, presented in different ways.

I beginning to think any on the fly filtering would have to be quite
simple because of the required response time. Something like looking at
URLs and titles (which are already separate in the index).

A more permanent solution would have to process the pages ahead of time,
trying to break it down into chunks, then maybe built up a list of
hashes of the chunks for each page. Then each page would have a
'fingerprint', and hopefully you could come up with a quick way to
compare them at query time.


-- 
Miles Barr <[EMAIL PROTECTED]>
Runtime Collective Ltd.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to