[jira] [Commented] (CONNECTORS-157) Root-relative paths without leading / do not resolve properly
[ https://issues.apache.org/jira/browse/CONNECTORS-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085949#comment-13085949 ] David Broadfoot commented on CONNECTORS-157: My bad - I miss-pasted the seed url: http://mysare.sare.org/MySare/ProjectReport.aspx?do=searchProjq=*amp;searchmethod=andregion=state=projType=0sortby=1page=1 That should make more sense :) Root-relative paths without leading / do not resolve properly - Key: CONNECTORS-157 URL: https://issues.apache.org/jira/browse/CONNECTORS-157 Project: ManifoldCF Issue Type: Bug Components: Web connector Affects Versions: ManifoldCF 0.1 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 0.2 If a document has a URL which is just the domain, e.g. http://foo.com;, the java.net.URI class fails to resolve URLs in that document which have no starting /, e.g. document.pdf. The resolved URI has no path part, e.g. http://foo.comdocument.pdf;. This is apparently a bug, but we need to find a way to work around it properly. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CONNECTORS-157) Root-relative paths without leading / do not resolve properly
[ https://issues.apache.org/jira/browse/CONNECTORS-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085959#comment-13085959 ] Karl Wright commented on CONNECTORS-157: I can see what the problem is right away. The relative urls are not legit according to w3c. It looks like I will need to replace java's implementation of url with my own in order to make such a thing work. I will experiment and get back to you. Root-relative paths without leading / do not resolve properly - Key: CONNECTORS-157 URL: https://issues.apache.org/jira/browse/CONNECTORS-157 Project: ManifoldCF Issue Type: Bug Components: Web connector Affects Versions: ManifoldCF 0.1 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 0.2 If a document has a URL which is just the domain, e.g. http://foo.com;, the java.net.URI class fails to resolve URLs in that document which have no starting /, e.g. document.pdf. The resolved URI has no path part, e.g. http://foo.comdocument.pdf;. This is apparently a bug, but we need to find a way to work around it properly. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CONNECTORS-157) Root-relative paths without leading / do not resolve properly
[ https://issues.apache.org/jira/browse/CONNECTORS-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085161#comment-13085161 ] David Broadfoot commented on CONNECTORS-157: Hi Karl - I've recently run into a very similar problem I have a job with this page as a seed: http://mysare.sare.org/MySareF?do=searchProjq=*amp;searchmethod=andregion=state=projType=0sortby=1page=1 and i want to crawl / ingest pages with do=viewRept in the urls. The links in the html are like this: a href=?do=viewReptamp;pn=LNE03-182amp;y=2004amp;t=02004 Annual Report When I check the simple history, I see lines like the following in the identifier column : http://mysare.sare.org/MySare/?do=viewReptpn=LNC04-240t=0y=2006 ie the /ProjectReport.aspx part is being omited from the crawled url. Any idea what is going on here? Root-relative paths without leading / do not resolve properly - Key: CONNECTORS-157 URL: https://issues.apache.org/jira/browse/CONNECTORS-157 Project: ManifoldCF Issue Type: Bug Components: Web connector Affects Versions: ManifoldCF 0.1 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 0.2 If a document has a URL which is just the domain, e.g. http://foo.com;, the java.net.URI class fails to resolve URLs in that document which have no starting /, e.g. document.pdf. The resolved URI has no path part, e.g. http://foo.comdocument.pdf;. This is apparently a bug, but we need to find a way to work around it properly. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CONNECTORS-157) Root-relative paths without leading / do not resolve properly
[ https://issues.apache.org/jira/browse/CONNECTORS-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085171#comment-13085171 ] Karl Wright commented on CONNECTORS-157: What is the url of the page with the link? Is it http://mysare.sare.org/MySareF?do=searchProjq=*amp;searchmethod=andregion=state=projType=0sortby=1page=1;? Because, if so, I don't see where /ProjectReport.aspx is supposed to be coming from. The relative URL composition rules in Java in general adhere to the w3c specifications. The problem is often that browsers do somewhat different things than the w3c spec. So let's work with your specific case and figure out what's happening. The only two inputs are: (a) the url of the page, and (b) the relative url of the reference. Since I don't see /ProjectReport.aspx in either one, it must either be in the page URL, or there must be a redirection taking place. Root-relative paths without leading / do not resolve properly - Key: CONNECTORS-157 URL: https://issues.apache.org/jira/browse/CONNECTORS-157 Project: ManifoldCF Issue Type: Bug Components: Web connector Affects Versions: ManifoldCF 0.1 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 0.2 If a document has a URL which is just the domain, e.g. http://foo.com;, the java.net.URI class fails to resolve URLs in that document which have no starting /, e.g. document.pdf. The resolved URI has no path part, e.g. http://foo.comdocument.pdf;. This is apparently a bug, but we need to find a way to work around it properly. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CONNECTORS-157) Root-relative paths without leading / do not resolve properly
[ https://issues.apache.org/jira/browse/CONNECTORS-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990061#comment-12990061 ] Karl Wright commented on CONNECTORS-157: After consultation, updated the fix to more narrowly target the problem case. r1066781. Root-relative paths without leading / do not resolve properly - Key: CONNECTORS-157 URL: https://issues.apache.org/jira/browse/CONNECTORS-157 Project: ManifoldCF Issue Type: Bug Components: Web connector Affects Versions: ManifoldCF 0.1 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF next If a document has a URL which is just the domain, e.g. http://foo.com;, the java.net.URI class fails to resolve URLs in that document which have no starting /, e.g. document.pdf. The resolved URI has no path part, e.g. http://foo.comdocument.pdf;. This is apparently a bug, but we need to find a way to work around it properly. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira