RE: General question about subdomains

2017-02-09 Thread Markus Jelsma
re may be more list subscribers interested in the set you have, so please share that with the list if you can. Thanks, Markus -Original message- > From:Joseph Naegele > Sent: Thursday 9th February 2017 23:39 > To: user@nutch.apache.org > Subject: RE: General question about s

RE: General question about subdomains

2017-02-09 Thread Joseph Naegele
ry 09, 2017 3:36 AM To: user@nutch.apache.org Subject: RE: General question about subdomains Hello Joseph, My colleague has not yet started to build a model for these crappy pages, but would still like to. We are going to run into this again soon enough so if you have any set of distinct crap

RE: General question about subdomains

2017-02-09 Thread Markus Jelsma
ted in doing. > > I'm still working on putting together a list of "bad" domains. > > Thanks > --- > Joe Naegele > Grier Forensics > > -Original Message- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Friday, January

RE: General question about subdomains

2017-02-08 Thread Joseph Naegele
-- Joe Naegele Grier Forensics -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Friday, January 13, 2017 10:00 AM To: user@nutch.apache.org Subject: RE: General question about subdomains Joseph - thank you very much! This is exactly the crap we are look

RE: General question about subdomains

2017-01-13 Thread Markus Jelsma
, but many have a different 4th octet. Regards, Markus -Original message- > From:Joseph Naegele > Sent: Friday 13th January 2017 15:11 > To: user@nutch.apache.org > Subject: RE: General question about subdomains > > Markus, > > Interestingly enough, we do use

RE: General question about subdomains

2017-01-13 Thread Joseph Naegele
d help. --- Joe Naegele Grier Forensics -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, January 11, 2017 9:43 AM To: user@nutch.apache.org Subject: RE: General question about subdomains Hello Joseph, The only feasible method, as i see, is being

RE: General question about subdomains

2017-01-13 Thread Joseph Naegele
ntent). Partitioning and fetching by IP is definitely a step in the right direction. --- Joe Naegele Grier Forensics -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Wednesday, January 11, 2017 9:32 AM To: user@nutch.apache.org Subject: Re: General question

RE: General question about subdomains

2017-01-11 Thread Markus Jelsma
Hello Joseph, The only feasible method, as i see, is being able to detect these kinds of spam sites as well as domain park sites, they produce lots of garbage as well. Once you detect them, you can chose not to follow outlinks, or to mark them in a domain-blacklist urlfilter. We have seen thes

Re: General question about subdomains

2017-01-11 Thread Julien Nioche
Hi Joe, Do these subdomains point to the same IP address? Did they blacklist your server i.e. can you connect to these domains from the crawl server using a different tool like curl? Not a silver bullet but a way of preventing this is to group by IP or domain (fetcher.queue.mode and partition.url