[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882968#comment-13882968
 ] 

Sebastian Nagel commented on NUTCH-1465:
----------------------------------------

Great, looks good and is a really compact providing a lot of functionality. 
I've just started to test SitemapProcessor, here my first comments:
* SitemapProcessor.java has no Apache license header
* would be nice to see counters in log output
* regarding Lewis' point #3: doesn't a comment "a hacky way" mean: "try to 
avoid that"? Why not set isHost inside map(...) by {{isHost = (value instanceof 
HostDatum)}} and pass it as parameter to filterNormalize()? This would avoid 
any errors due to incomplete heuristics, here when testing with sitemaps 
accessed per file protocol:
{code}
INFO  api.HttpRobotRulesParser - Couldn't get robots.txt for 
http://file:/tmp/sitemap1.xml/: java.net.UnknownHostException: file
{code}
* concurrency: "returning" the value of isHost from filterNormalize() to map() 
per member variable is not thread-safe and will cause problems in combination 
with MultithreadedMapper. One argument more to pass it from map() to 
filterNormalize() per parameter.


> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.8
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to