[jira] [Commented] (CONNECTORS-157) Root-relative paths without leading / do not resolve properly

2011-08-16 Thread David Broadfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085949#comment-13085949
 ] 

David Broadfoot commented on CONNECTORS-157:


My bad - I miss-pasted the seed url:

http://mysare.sare.org/MySare/ProjectReport.aspx?do=searchProjq=*amp;searchmethod=andregion=state=projType=0sortby=1page=1

That should make more sense :)



 Root-relative paths without leading / do not resolve properly
 -

 Key: CONNECTORS-157
 URL: https://issues.apache.org/jira/browse/CONNECTORS-157
 Project: ManifoldCF
  Issue Type: Bug
  Components: Web connector
Affects Versions: ManifoldCF 0.1
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.2


 If a document has a URL which is just the domain, e.g. http://foo.com;, the 
 java.net.URI class fails to resolve URLs in that document which have no 
 starting /, e.g. document.pdf.  The resolved URI has no path part, e.g. 
 http://foo.comdocument.pdf;.  This is apparently a bug, but we need to find 
 a way to work around it properly.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-157) Root-relative paths without leading / do not resolve properly

2011-08-16 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085959#comment-13085959
 ] 

Karl Wright commented on CONNECTORS-157:


I can see what the problem is right away.  The relative urls are not legit 
according to w3c.  It looks like I will need to replace java's implementation 
of url with my own in order to make such a thing work.  I will experiment and 
get back to you.



 Root-relative paths without leading / do not resolve properly
 -

 Key: CONNECTORS-157
 URL: https://issues.apache.org/jira/browse/CONNECTORS-157
 Project: ManifoldCF
  Issue Type: Bug
  Components: Web connector
Affects Versions: ManifoldCF 0.1
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.2


 If a document has a URL which is just the domain, e.g. http://foo.com;, the 
 java.net.URI class fails to resolve URLs in that document which have no 
 starting /, e.g. document.pdf.  The resolved URI has no path part, e.g. 
 http://foo.comdocument.pdf;.  This is apparently a bug, but we need to find 
 a way to work around it properly.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-157) Root-relative paths without leading / do not resolve properly

2011-08-15 Thread David Broadfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085161#comment-13085161
 ] 

David Broadfoot commented on CONNECTORS-157:


Hi Karl - I've recently run into a very similar problem

I have a job with this page as a seed:

http://mysare.sare.org/MySareF?do=searchProjq=*amp;searchmethod=andregion=state=projType=0sortby=1page=1

and i want to crawl / ingest pages with 

do=viewRept

in the urls. The links in the html are like this:

a href=?do=viewReptamp;pn=LNE03-182amp;y=2004amp;t=02004 Annual Report

When I check the simple history, I see lines like the following in the 
identifier column :

http://mysare.sare.org/MySare/?do=viewReptpn=LNC04-240t=0y=2006

ie the /ProjectReport.aspx part is being omited from the crawled url.

Any idea what is going on here?



 Root-relative paths without leading / do not resolve properly
 -

 Key: CONNECTORS-157
 URL: https://issues.apache.org/jira/browse/CONNECTORS-157
 Project: ManifoldCF
  Issue Type: Bug
  Components: Web connector
Affects Versions: ManifoldCF 0.1
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.2


 If a document has a URL which is just the domain, e.g. http://foo.com;, the 
 java.net.URI class fails to resolve URLs in that document which have no 
 starting /, e.g. document.pdf.  The resolved URI has no path part, e.g. 
 http://foo.comdocument.pdf;.  This is apparently a bug, but we need to find 
 a way to work around it properly.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-157) Root-relative paths without leading / do not resolve properly

2011-08-15 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085171#comment-13085171
 ] 

Karl Wright commented on CONNECTORS-157:


What is the url of the page with the link?  Is it 
http://mysare.sare.org/MySareF?do=searchProjq=*amp;searchmethod=andregion=state=projType=0sortby=1page=1;?
  Because, if so, I don't see where /ProjectReport.aspx is supposed to be 
coming from.

The relative URL composition rules in Java in general adhere to the w3c 
specifications.  The problem is often that browsers do somewhat different 
things than the w3c spec.  So let's work with your specific case and figure out 
what's happening.  The only two inputs are: (a) the url of the page, and (b) 
the relative url of the reference.  Since I don't see /ProjectReport.aspx in 
either one, it must either be in the page URL, or there must be a redirection 
taking place.


 Root-relative paths without leading / do not resolve properly
 -

 Key: CONNECTORS-157
 URL: https://issues.apache.org/jira/browse/CONNECTORS-157
 Project: ManifoldCF
  Issue Type: Bug
  Components: Web connector
Affects Versions: ManifoldCF 0.1
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.2


 If a document has a URL which is just the domain, e.g. http://foo.com;, the 
 java.net.URI class fails to resolve URLs in that document which have no 
 starting /, e.g. document.pdf.  The resolved URI has no path part, e.g. 
 http://foo.comdocument.pdf;.  This is apparently a bug, but we need to find 
 a way to work around it properly.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (CONNECTORS-157) Root-relative paths without leading / do not resolve properly

2011-02-03 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990061#comment-12990061
 ] 

Karl Wright commented on CONNECTORS-157:


After consultation, updated the fix to more narrowly target the problem case.  
r1066781.


 Root-relative paths without leading / do not resolve properly
 -

 Key: CONNECTORS-157
 URL: https://issues.apache.org/jira/browse/CONNECTORS-157
 Project: ManifoldCF
  Issue Type: Bug
  Components: Web connector
Affects Versions: ManifoldCF 0.1
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF next


 If a document has a URL which is just the domain, e.g. http://foo.com;, the 
 java.net.URI class fails to resolve URLs in that document which have no 
 starting /, e.g. document.pdf.  The resolved URI has no path part, e.g. 
 http://foo.comdocument.pdf;.  This is apparently a bug, but we need to find 
 a way to work around it properly.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira