[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-1465: ----------------------------------- Attachment: NUTCH-1465-sitemapinjector-trunk-v1.patch Hi Tejas, attached you'll find a patch for a sitemap injector. Originally written by Hannes Schwarz, it's used by use for a couple of time. The patch contains a revised and improved version which, however, needs some more work (see TODOs in code). The use case is somewhat different from way B: The sitemap injector takes URLs of sitemaps (not via robots.txt) and injects them directly to CrawlDb (no extra sitemapDB - do we really need an extra DB?). Robots.txt is not used as an intermediate step/hop because experience has shown that often customers prepare a special sitemap for the site search crawler which differs from the sitemap propagated in robots.txt. Btw., NUTCH-1622 would enable solution A: outlinks now can hold extra info. > Support sitemaps in Nutch > ------------------------- > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser > Reporter: Lewis John McGibbney > Assignee: Tejas Patil > Fix For: 1.9 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.4#6159)