While crawling, all the links on page 1 are crawled but on page 2
(pagination) as there is some javascript applied, the crawler ignore the
page.
Can you please help me with an approach by which crawler can crawl that
data on page 2.
Regards,
Manav
Hello Joseph,
The only feasible method, as i see, is being able to detect these kinds of spam
sites as well as domain park sites, they produce lots of garbage as well. Once
you detect them, you can chose not to follow outlinks, or to mark them in a
domain-blacklist urlfilter.
We have seen thes
Hi Joe,
Do these subdomains point to the same IP address? Did they blacklist your
server i.e. can you connect to these domains from the crawl server using a
different tool like curl?
Not a silver bullet but a way of preventing this is to group by IP or
domain (fetcher.queue.mode and partition.url
This is more of a general question, not Nutch-specific:
Our crawler discovered some URLs pointing to a number of subdomains of a
Chinese-owned [strmy domain. It then proceeded to discover millions more URLs
pointing to other subdomains (hosts) of the same domain. Most of the names
appear to be
4 matches
Mail list logo