On Sep 19, 2011, at 1:52pm, Markus Jelsma wrote:

> Hi,
> 
> I sometimes come across relative outlinks in the source that are intended as 
> absolute but where the webmaster or CMS omits the protocol scheme. This 
> results in repeating URI segments and crap URL's. 
> 
> Would an option that treat such URL's as absolute be a good idea? This 
> problem 
> is similar to the other thread with relative URL's without a base. 
> 
> The issue right now is that Tika already does the URL resolving as part of 
> the 
> parsing so we have no control.

I've seen the same thing during crawls.

But how would you know the difference between super.uk/page => 
http://super.uk/page and http://domain.com/super.uk/page?

Or would you only do this special handling when the first piece of the relative 
URL matches the base URL's domain?

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr



Reply via email to