[jira] [Commented] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

Sebastian Nagel (JIRA) Sun, 30 Mar 2014 14:12:24 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954836#comment-13954836
 ]


Sebastian Nagel commented on NUTCH-1741:
----------------------------------------

Hi [~alparslan.avci], the plan for solution B is really detailed. It also looks 
somewhat complex, not far away from the complexity of solution A which 
definitely is harder to integrate but would be simpler for the user (sitemaps 
are automatically detected, no changes to crawler workflow). But the argument 
"control to the user" is important, no matter. A few questions on details:
* "takes advantage of standard FetcherJob ..."
-- what about sitemap indexes? They can't be fetched in one turn, yet, cannot 
be hold in one web table row because a sitemap index has multiple URLs.
-- do we really need queues and politeness when fetching only sitemaps? There's 
rarely more than one sitemap per host.
-- "adaptive fetch schedule for sitemaps": that's an interesting idea, it may 
help in case of forgotten and hopelessly outdated sitemaps. But isn't a sitemap 
more like robots.txt?  -- only cached for a short time and re-fetched within 
short periods because a fresh sitemap may contain fresh links
* "SitemapParserJob": that's a combination of parser + updatedb, right?
* "Parses the sitemap document with plugins like XML, RSS, plain text."
-- Does it mean these plugin(s) has/have to be written?



> Support of Sitemaps in Nutch 2.x
> --------------------------------
>
>                 Key: NUTCH-1741
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1741
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, generator
>            Reporter: Alparslan Avcı
>             Fix For: 2.3
>
>         Attachments: SitemapDevelopmentFor2x.pdf
>
>
> Sitemap support has to be implemented for 2.x branch. It is being discussed 
> in NUTCH-1465 for trunk. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

Reply via email to