[ https://issues.apache.org/jira/browse/CONNECTORS-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085161#comment-13085161 ]
David Broadfoot commented on CONNECTORS-157: -------------------------------------------- Hi Karl - I've recently run into a very similar problem I have a job with this page as a seed: http://mysare.sare.org/MySareF?do=searchProj&q=*&searchmethod=and®ion=&state=&projType=0&sortby=1&page=1 and i want to crawl / ingest pages with do=viewRept in the urls. The links in the html are like this: <a href="?do=viewRept&pn=LNE03-182&y=2004&t=0">2004 Annual Report When I check the simple history, I see lines like the following in the identifier column : http://mysare.sare.org/MySare/?do=viewRept&pn=LNC04-240&t=0&y=2006 ie the /ProjectReport.aspx part is being omited from the crawled url. Any idea what is going on here? > Root-relative paths without leading / do not resolve properly > ------------------------------------------------------------- > > Key: CONNECTORS-157 > URL: https://issues.apache.org/jira/browse/CONNECTORS-157 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector > Affects Versions: ManifoldCF 0.1 > Reporter: Karl Wright > Assignee: Karl Wright > Fix For: ManifoldCF 0.2 > > > If a document has a URL which is just the domain, e.g. "http://foo.com", the > java.net.URI class fails to resolve URLs in that document which have no > starting "/", e.g. "document.pdf". The resolved URI has no path part, e.g. > "http://foo.comdocument.pdf". This is apparently a bug, but we need to find > a way to work around it properly. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira