> On Sep 19, 2011, at 1:52pm, Markus Jelsma wrote: > > Hi, > > > > I sometimes come across relative outlinks in the source that are intended > > as absolute but where the webmaster or CMS omits the protocol scheme. > > This results in repeating URI segments and crap URL's. > > > > Would an option that treat such URL's as absolute be a good idea? This > > problem is similar to the other thread with relative URL's without a > > base. > > > > The issue right now is that Tika already does the URL resolving as part > > of the parsing so we have no control. > > I've seen the same thing during crawls.
Glad to hear we're not alone in this mess ;) > > But how would you know the difference between super.uk/page => > http://super.uk/page and http://domain.com/super.uk/page? > > Or would you only do this special handling when the first piece of the > relative URL matches the base URL's domain? Yes! That it a very common pattern with this issue. It doesn't yet handle dotted segments but those are less common. Doing this as part of URL resolving is more CPU friendly than a regular expression that looks for the FQDN in the first or any URI segment. Thanks! > > -- Ken > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > custom big data solutions & training > Hadoop, Cascading, Mahout & Solr

