Markus Jelsma created NUTCH-2730:
------------------------------------
Summary: SitemapProcessor to treat sitemap URLs as Set instead of
List
Key: NUTCH-2730
URL: https://issues.apache.org/jira/browse/NUTCH-2730
Project: Nutch
Issue Type: Improvement
Components: sitemap
Affects Versions: 1.15
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Fix For: 1.16
https://archive.epa.gov/robots.txt lists 160k sitemap URLs, absurd! Almost 160k
of them are duplicates, no friendly words to describe this astonishing fact.
And although our Nutch locally chews through this list in 22s, for some weird
reason the big job on Hadoop fails, although it is also working on a lot more.
Maybe this is not a problem, maybe it is. Nevertheless, treating them as Set
and not List makes sense.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)