Hi All,
We're merrily proceeding down our route of a country specific search
engine, nutch seems to be working well. However we're finding some
sites creeping in that aren't from our country. Specifically, we
automatically allow in sites that are hosted within the country. We're
finding more sites than we'd like hosted here that are actually
owned/operated in another country and thus not relevant. I'd like to
get rid of these if I can.
Is there a viable way of using nutch 0.7 using only a whitelist of sites
- and a very large whitelist at that (say 500K to a million+ sites, all
in one whitelist)? If not, is it possible in nutch 0.8? That way I can
just find other ways of adding known-to-be-good sites into the white
list over time.
(fwiw, we automatically allow our specific country TLD, then for
.com/.net/.org we only allow if the site is physically hosted here by
checking an IP list. If other country search engine folks have comments
on a better way to do this I'd welcome the input.).
- Searching only a whitelist (country specific SE) Insurance Squared Inc.
-