[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887588#comment-13887588 ]
Sebastian Nagel commented on NUTCH-1465: ---------------------------------------- "filters and normalizers": -noFilter is not really an option if sitemaps are used and gzipped documents (eg. software packages) shall be excluded. In customized crawls URL filter rules are often complex, and I want to avoid to have to sets of rules in the end. Sitemaps are different from normal docs/URLs (robots.txt is also different): they are not stored in CrawlDb and may require other filter rules. What about an option "-noFilterSitemap"? "Fetch intervals of 1 second or 1 hour may cause troubles": > We are blindly accepting user's custom information in inject. Yes, because the user (crawl administrator) can change the seed list (it's a file/directory on local disk or HDFS). Sitemaps are not necessarily under control of the user. If we (optionally) adjust fetch interval by (configurable) min/max limits that would help to get unreasonable values, and eg. re-fetch a bunch of pages every cycle. "SitemapReducer overwriting" : In a continuous crawl we know when pages are modified and have heuristics to estimate the change frequency of a page (AdaptiveFetchSchedule). The question is whether we trust those values which are achieved from crawling or prefer (possibly bogus) values from sitemaps. To use the sitemap values for new URLs found in sitemaps is less critical. > (a) score : Crawler commons assigns a default score of 0.5 if there was none > provided in sitemap. Needs an upgrade of crawler-commons (0.2 is still used which sets priority to 0). > Support sitemaps in Nutch > ------------------------- > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser > Reporter: Lewis John McGibbney > Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)