On Sep 19, 2011, at 1:52pm, Markus Jelsma wrote: > Hi, > > I sometimes come across relative outlinks in the source that are intended as > absolute but where the webmaster or CMS omits the protocol scheme. This > results in repeating URI segments and crap URL's. > > Would an option that treat such URL's as absolute be a good idea? This > problem > is similar to the other thread with relative URL's without a base. > > The issue right now is that Tika already does the URL resolving as part of > the > parsing so we have no control.
I've seen the same thing during crawls. But how would you know the difference between super.uk/page => http://super.uk/page and http://domain.com/super.uk/page? Or would you only do this special handling when the first piece of the relative URL matches the base URL's domain? -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

