[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886489#comment-13886489
 ] 

Sebastian Nagel commented on NUTCH-1465:
----------------------------------------

SitemapReducer overwrites score, modified time, and fetch interval of existing 
CrawlDb entries with the values from sitemap. Is this the desired behavior? 
What about forgotten, hopeless outdated sitemap? Or bogus values (last mod in 
the future)?
If a sitemap does not specify one of score, modified time, or fetch interval 
this values is set to zero. In this case, we should definitely not overwrite 
existing values. Newly added entries should get assigned 
db.fetch.interval.default and a reasonable score, eg. 0.5 as recommended by 
[[2|http://www.sitemaps.org/protocol.html#xmlTagDefinitions]]. But that may 
depend on scoring plugins. Comments?

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.8
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to