> On Sep 19, 2011, at 1:52pm, Markus Jelsma wrote:
> > Hi,
> > 
> > I sometimes come across relative outlinks in the source that are intended
> > as absolute but where the webmaster or CMS omits the protocol scheme.
> > This results in repeating URI segments and crap URL's.
> > 
> > Would an option that treat such URL's as absolute be a good idea? This
> > problem is similar to the other thread with relative URL's without a
> > base.
> > 
> > The issue right now is that Tika already does the URL resolving as part
> > of the parsing so we have no control.
> 
> I've seen the same thing during crawls.

Glad to hear we're not alone in this mess ;)

> 
> But how would you know the difference between super.uk/page =>
> http://super.uk/page and http://domain.com/super.uk/page?
> 
> Or would you only do this special handling when the first piece of the
> relative URL matches the base URL's domain?

Yes! That it a very common pattern with this issue. It doesn't yet handle 
dotted segments but those are less common.

Doing this as part of URL resolving is more CPU friendly than a regular 
expression that looks for the FQDN in the first or any URI segment.

Thanks!

> 
> -- Ken
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr

Reply via email to