[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070485#comment-16070485
 ] 

Markus Jelsma commented on NUTCH-1465:
--------------------------------------

Hi Lewis!

It appears to be working fine now and bug-free due to not having the input 
overwrite existing CrawlDb entry interval and modified times because:
* that is messy in Nutch
* websites tend to set bad values, almost always, such as 100k large websites 
signaling to refetch everything daily

We have it deployed but not activated, that's the plan for early next week.

The patch is based on the mess in this thread's latest comments, and most 
recent scraps i found on Github. It should be the most recent contributions you 
guys added.

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.14
>
>         Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to