Hello Apache Supporters and Enthusiasts
This is your FINAL reminder that the Call for Papers (CFP) for the
Apache EU Roadshow is closing soon. Our Apache EU Roadshow will focus on
Cloud, IoT, Apache Tomcat, Apache Http and will run from 13-14 June 2018
in Berlin.
Note that the CFP deadline has
> 1) Do we have a config setting that we can use already?
Not out-of-the-box. But there is already an extension point for your use case
[1]:
the filter method takes to arguments (fromURL and toURL).
Have a look at it, maybe you can fix it by implementing/contributing a plugin.
> 2) ... It looks
Hi Sabastian,
If I
- modify the method URLUtil.getDomainName(URL url)
doesn't it mean that I don't need
- set db.ignore.external.links.mode=byDomain
anymore? http://www.somewebsite.com becomes the same host as somewhebsite.com.
To make it as generic as possible I can create an issue/pull req
Hi Semyon,
> interpret www.somewebsite.com and somewhebsite.com as one host?
Yes, that's a common problem. More because of external links which must
include the host name - well-designed sites would use relative links
for internal same-host links.
For a quick work-around:
- set db.ignore.externa
Thanks Yossi, Markus,
I have an issue with the db.ignore.external.links.mode=byDomain solution.
I crawl specific hosts only therefore I have a finite number of hosts to crawl.
Lets say, www.somewebsite.com
I want to stay limited with this host. In other words, neither
www.art.somewebsite.com no
5 matches
Mail list logo