[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070485#comment-16070485 ]
Markus Jelsma commented on NUTCH-1465: -------------------------------------- Hi Lewis! It appears to be working fine now and bug-free due to not having the input overwrite existing CrawlDb entry interval and modified times because: * that is messy in Nutch * websites tend to set bad values, almost always, such as 100k large websites signaling to refetch everything daily We have it deployed but not activated, that's the plan for early next week. The patch is based on the mess in this thread's latest comments, and most recent scraps i found on Github. It should be the most recent contributions you guys added. > Support sitemaps in Nutch > ------------------------- > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)