[
https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859286#action_12859286
]
Julien Nioche commented on NUTCH-710:
-------------------------------------
As suggested previously we could either treat canonicals as redirections or
during deduplication. Neither are satisfactory solutions.
Redirection : we want to index the document if/when the target of the canonical
is not available for indexing. We also want to follow the outlinks.
Dedup : could modify the *DeleteDuplicates code but canonical are more complex
due to fact that we need to follow redirections
We probably need a third approach: prefilter by going through the crawldb &
detect URLs which have a canonical target already indexed or ready to be
indexed. We need to follow up to X levels of redirection e.g. doc A marked as
canonical representation doc B, doc B redirects to doc C etc...if end of
redirection chain exists and is valid then mark A as duplicate of C
(intermediate redirs will not get indexed anyway)
As we don't know if has been indexed yet we would give it a special marker
(e.g. status_duplicate) in the crawlDB. Then
-> if indexer comes across such an entry : skip it
-> make so that *deleteDuplicates can take a list of URLs with status_duplicate
as an additional source of input OR have a custom resource that deletes such
entries in SOLR or Lucene indices
The implementation would be as follows :
Go through all redirections and generate all redirection chains e.g.
A -> B
B -> C
D -> C
where C is an indexable document (i.e. has been fetched and parsed - it may
have been already indexed.
will yield
A -> C
B -> C
D -> C
but also
C -> C
Once we have all possible redirections : go through the crawlDB in search of
canonicals. if the target of a canonical is the source of a valid alias (e.g. A
- B - C - D) mark it as 'status:duplicate'
This design implies generating quite a few intermediate structures + scanning
the whole crawlDB twice (once of the aliases then for the canonical) + rewrite
the whole crawlDB to mark some of the entries as duplicates.
This would be much easier to do when we have Nutch2/HBase : could simply follow
the redirs from the initial URL having a canonical tag instead of generating
these intermediate structures. We can then modify the entries one by one
instead of regenerating the whole crawlDB.
WDYT?
> Support for rel="canonical" attribute
> -------------------------------------
>
> Key: NUTCH-710
> URL: https://issues.apache.org/jira/browse/NUTCH-710
> Project: Nutch
> Issue Type: New Feature
> Affects Versions: 1.1
> Reporter: Frank McCown
> Priority: Minor
>
> There is a the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
> Adding support for this attribute value will potentially reduce the number of
> URLs crawled and indexed and reduce duplicate page content.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.