[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883523#comment-13883523
 ] 

Sebastian Nagel commented on NUTCH-1465:
----------------------------------------

Sorry, you're right: the comment "hacky way" applies to trying http and https 
to check which host-URL would pass the filters. That's ok, there is no better 
solution for that.
But what about the decision whether a string passed to filterNormalize() is a 
host from HostDb or a URL from a list of sitemaps? This decision could be made 
without any heuristics: inside map() we know the type (host or sitemap Url) 
from the class of the value:
{code}
boolean isHost = (value instanceof HostDatum);
String url = filterNormalize(key.toString(), isHost);
{code}
The method filterNormalize() could be then simplified and the member variable 
isHost would be obsolete.
Regarding concurrency: the javadoc of 
[[MultithreadedMapper.java|http://hadoop.apache.org/docs/stable/api/src-html/org/apache/hadoop/mapreduce/lib/map/MultithreadedMapper.html]]
 states that "Mapper implementations using this MapRunnable must be 
thread-safe." In doubt, it may be better to follow this advice and not to look 
at the (current) implementation. If SitemapParser is thread-safe (at a first 
glance, it is) it should be easy to get SitemapMapper safe.

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.8
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to