[ https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954836#comment-13954836 ]
Sebastian Nagel commented on NUTCH-1741: ---------------------------------------- Hi [~alparslan.avci], the plan for solution B is really detailed. It also looks somewhat complex, not far away from the complexity of solution A which definitely is harder to integrate but would be simpler for the user (sitemaps are automatically detected, no changes to crawler workflow). But the argument "control to the user" is important, no matter. A few questions on details: * "takes advantage of standard FetcherJob ..." -- what about sitemap indexes? They can't be fetched in one turn, yet, cannot be hold in one web table row because a sitemap index has multiple URLs. -- do we really need queues and politeness when fetching only sitemaps? There's rarely more than one sitemap per host. -- "adaptive fetch schedule for sitemaps": that's an interesting idea, it may help in case of forgotten and hopelessly outdated sitemaps. But isn't a sitemap more like robots.txt? -- only cached for a short time and re-fetched within short periods because a fresh sitemap may contain fresh links * "SitemapParserJob": that's a combination of parser + updatedb, right? * "Parses the sitemap document with plugins like XML, RSS, plain text." -- Does it mean these plugin(s) has/have to be written? > Support of Sitemaps in Nutch 2.x > -------------------------------- > > Key: NUTCH-1741 > URL: https://issues.apache.org/jira/browse/NUTCH-1741 > Project: Nutch > Issue Type: New Feature > Components: fetcher, generator > Reporter: Alparslan Avcı > Fix For: 2.3 > > Attachments: SitemapDevelopmentFor2x.pdf > > > Sitemap support has to be implemented for 2.x branch. It is being discussed > in NUTCH-1465 for trunk. -- This message was sent by Atlassian JIRA (v6.2#6252)