[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882968#comment-13882968 ]
Sebastian Nagel commented on NUTCH-1465: ---------------------------------------- Great, looks good and is a really compact providing a lot of functionality. I've just started to test SitemapProcessor, here my first comments: * SitemapProcessor.java has no Apache license header * would be nice to see counters in log output * regarding Lewis' point #3: doesn't a comment "a hacky way" mean: "try to avoid that"? Why not set isHost inside map(...) by {{isHost = (value instanceof HostDatum)}} and pass it as parameter to filterNormalize()? This would avoid any errors due to incomplete heuristics, here when testing with sitemaps accessed per file protocol: {code} INFO api.HttpRobotRulesParser - Couldn't get robots.txt for http://file:/tmp/sitemap1.xml/: java.net.UnknownHostException: file {code} * concurrency: "returning" the value of isHost from filterNormalize() to map() per member variable is not thread-safe and will cause problems in combination with MultithreadedMapper. One argument more to pass it from map() to filterNormalize() per parameter. > Support sitemaps in Nutch > ------------------------- > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser > Reporter: Lewis John McGibbney > Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)