Nutch - Crawler not following next pages in paginated content

2017-01-11 Thread Manav Bagai
While crawling, all the links on page 1 are crawled but on page 2 (pagination) as there is some javascript applied, the crawler ignore the page. Can you please help me with an approach by which crawler can crawl that data on page 2. Regards, Manav

RE: General question about subdomains

2017-01-11 Thread Markus Jelsma
Hello Joseph, The only feasible method, as i see, is being able to detect these kinds of spam sites as well as domain park sites, they produce lots of garbage as well. Once you detect them, you can chose not to follow outlinks, or to mark them in a domain-blacklist urlfilter. We have seen thes

Re: General question about subdomains

2017-01-11 Thread Julien Nioche
Hi Joe, Do these subdomains point to the same IP address? Did they blacklist your server i.e. can you connect to these domains from the crawl server using a different tool like curl? Not a silver bullet but a way of preventing this is to group by IP or domain (fetcher.queue.mode and partition.url

General question about subdomains

2017-01-11 Thread Joseph Naegele
This is more of a general question, not Nutch-specific: Our crawler discovered some URLs pointing to a number of subdomains of a Chinese-owned [strmy domain. It then proceeded to discover millions more URLs pointing to other subdomains (hosts) of the same domain. Most of the names appear to be