FINAL REMINDER: CFP for Apache EU Roadshow Closes 25th February

2018-02-21 Thread Sharan F
Hello Apache Supporters and Enthusiasts This is your FINAL reminder that the Call for Papers (CFP) for the Apache EU Roadshow is closing soon. Our Apache EU Roadshow will focus on Cloud, IoT, Apache Tomcat, Apache Http and will run from 13-14 June 2018 in Berlin. Note that the CFP deadline has

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-02-21 Thread Sebastian Nagel
> 1) Do we have a config setting that we can use already? Not out-of-the-box. But there is already an extension point for your use case [1]: the filter method takes to arguments (fromURL and toURL). Have a look at it, maybe you can fix it by implementing/contributing a plugin. > 2) ... It looks

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-02-21 Thread Semyon Semyonov
Hi Sabastian, If I - modify the method URLUtil.getDomainName(URL url) doesn't it mean that I don't need  - set db.ignore.external.links.mode=byDomain anymore? http://www.somewebsite.com becomes the same host as somewhebsite.com. To make it as generic as possible I can create an issue/pull req

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-02-21 Thread Sebastian Nagel
Hi Semyon, > interpret www.somewebsite.com and somewhebsite.com as one host? Yes, that's a common problem. More because of external links which must include the host name - well-designed sites would use relative links for internal same-host links. For a quick work-around: - set db.ignore.externa

Re: RE: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-02-21 Thread Semyon Semyonov
Thanks Yossi, Markus, I have an issue with the db.ignore.external.links.mode=byDomain solution. I crawl specific hosts only therefore I have a finite number of hosts to crawl. Lets say, www.somewebsite.com I want to stay limited with this host. In other words, neither www.art.somewebsite.com no