[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tejas Patil updated NUTCH-1465: ------------------------------- Attachment: NUTCH-1465-trunk.v2.patch Attaching NUTCH-1465-trunk.v2.patch which has implementation of *option (B)* _Have separate job for the sitemap stuff and merge its output into the crawldb_ +I have tied both the cases in this patch:+ 1. users with targeted crawl who want to get sitemaps injected from a list of sitemap urls - the use case which [~wastl-nagel] had pointed out. 2. large open web crawls where users cannot afford to generate sitemap seeds for all the hosts and want nutch to inject sitemaps automatically. +To try out this patch:+ 1. Apply the patch for HostDb feature (https://issues.apache.org/jira/secure/attachment/12624178/NUTCH-1325-trunk-v4.patch) 2. Apply this patch (NUTCH-1465-trunk.v2.patch) 3. (optional) Add this to conf/log4j.properties at line 11: {noformat} log4j.logger.org.apache.nutch.util.SitemapProcessor=INFO,cmdstdout {noformat} 3. Run using {noformat} bin/nutch org.apache.nutch.util.SitemapProcessor {noformat} I have started working on a *wiki page* describing this feature: https://wiki.apache.org/nutch/SitemapFeature Any suggestion and comments are welcome. > Support sitemaps in Nutch > ------------------------- > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser > Reporter: Lewis John McGibbney > Assignee: Tejas Patil > Fix For: 1.9 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)