Re: webcrawler connector and dns lookups behind corporate http proxy

Markus Schuch Tue, 11 Oct 2016 15:12:07 -0700

Hi Karl,
 
thanks for the suggestion. I tried it but the crawled website sends 301 
redirects to the canonical hostname when requesting pages directly via ip 
address - which leads again to the ip lookup.
Guess i'm stuck with the /etc/hosts solution then. This will get messy if the 
ip changes often.


I'm interested to understand the mechanics of the crawler better: what is the 
reason for resolving the IP addresses instead of using the Hostnamen?
 
Thanks
Markus
 

Gesendet: Montag, 10. Oktober 2016 um 22:00 Uhr
Von: "Karl Wright" <daddy...@gmail.com>
An: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
Betreff: Re: webcrawler connector and dns lookups behind corporate http proxy

If the proxy is not authenticated, I think you can just put the IP address in 
as the machine name and it should work.  But that's all I can think of.
 
Karl
 
 
On Mon, Oct 10, 2016 at 3:44 PM, Markus Schuch 
<markus_sch...@web.de[mailto:markus_sch...@web.de]> wrote:Hi @ the lovely mcf 
community out there,
 
in our setup we run manifoldcf (2.3) behind a corporate http proxy server and 
we try to crawl specific web pages in the internet.
 
We run into java.net[http://java.net].UnknownHostException because the 
connector tries to resolve the ip of the hostname. This fails, because our 
network setup does not allow direct dns lookups for internet pages and the JDKs 
InetAddress.getByName() call relies on the systems dns lookup mechanisms. All 
internet traffic goes through the corporate http proxy server which does all 
necessary dns resolution on his side.
 
Can you think of any other (more elegant) solution besides adding the records 
to /etc/hosts on the crawlers machine?
 
Many thanks in advance,
Markus

Re: webcrawler connector and dns lookups behind corporate http proxy

Reply via email to