Hi Karl, thanks for the suggestion. I tried it but the crawled website sends 301 redirects to the canonical hostname when requesting pages directly via ip address - which leads again to the ip lookup. Guess i'm stuck with the /etc/hosts solution then. This will get messy if the ip changes often.
I'm interested to understand the mechanics of the crawler better: what is the reason for resolving the IP addresses instead of using the Hostnamen? Thanks Markus Gesendet: Montag, 10. Oktober 2016 um 22:00 Uhr Von: "Karl Wright" <daddy...@gmail.com> An: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org> Betreff: Re: webcrawler connector and dns lookups behind corporate http proxy If the proxy is not authenticated, I think you can just put the IP address in as the machine name and it should work. But that's all I can think of. Karl On Mon, Oct 10, 2016 at 3:44 PM, Markus Schuch <markus_sch...@web.de[mailto:markus_sch...@web.de]> wrote:Hi @ the lovely mcf community out there, in our setup we run manifoldcf (2.3) behind a corporate http proxy server and we try to crawl specific web pages in the internet. We run into java.net[http://java.net].UnknownHostException because the connector tries to resolve the ip of the hostname. This fails, because our network setup does not allow direct dns lookups for internet pages and the JDKs InetAddress.getByName() call relies on the systems dns lookup mechanisms. All internet traffic goes through the corporate http proxy server which does all necessary dns resolution on his side. Can you think of any other (more elegant) solution besides adding the records to /etc/hosts on the crawlers machine? Many thanks in advance, Markus