re may be more list subscribers interested in the set
you have, so please share that with the list if you can.
Thanks,
Markus
-Original message-
> From:Joseph Naegele
> Sent: Thursday 9th February 2017 23:39
> To: user@nutch.apache.org
> Subject: RE: General question about s
ry 09, 2017 3:36 AM
To: user@nutch.apache.org
Subject: RE: General question about subdomains
Hello Joseph,
My colleague has not yet started to build a model for these crappy pages, but
would still like to. We are going to run into this again soon enough so if you
have any set of distinct crap
ted in doing.
>
> I'm still working on putting together a list of "bad" domains.
>
> Thanks
> ---
> Joe Naegele
> Grier Forensics
>
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Friday, January
--
Joe Naegele
Grier Forensics
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Friday, January 13, 2017 10:00 AM
To: user@nutch.apache.org
Subject: RE: General question about subdomains
Joseph - thank you very much!
This is exactly the crap we are look
, but many
have a different 4th octet.
Regards,
Markus
-Original message-
> From:Joseph Naegele
> Sent: Friday 13th January 2017 15:11
> To: user@nutch.apache.org
> Subject: RE: General question about subdomains
>
> Markus,
>
> Interestingly enough, we do use
d help.
---
Joe Naegele
Grier Forensics
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Wednesday, January 11, 2017 9:43 AM
To: user@nutch.apache.org
Subject: RE: General question about subdomains
Hello Joseph,
The only feasible method, as i see, is being
ntent).
Partitioning and fetching by IP is definitely a step in the right direction.
---
Joe Naegele
Grier Forensics
-Original Message-
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
Sent: Wednesday, January 11, 2017 9:32 AM
To: user@nutch.apache.org
Subject: Re: General question
Hello Joseph,
The only feasible method, as i see, is being able to detect these kinds of spam
sites as well as domain park sites, they produce lots of garbage as well. Once
you detect them, you can chose not to follow outlinks, or to mark them in a
domain-blacklist urlfilter.
We have seen thes
Hi Joe,
Do these subdomains point to the same IP address? Did they blacklist your
server i.e. can you connect to these domains from the crawl server using a
different tool like curl?
Not a silver bullet but a way of preventing this is to group by IP or
domain (fetcher.queue.mode and partition.url
9 matches
Mail list logo