I think it will be much easier to validate the seeds list by using JavaScript instead of parsing urls with java.net.URL, simply because this is how we do validation elsewhere in the application.

Checking for valid URLs, supported protocols and illegal characters shouldn't be very complicated by using JavaScript.

What do you think?

Erlend

On 16.03.12 11.51, Karl Wright wrote:
"Do you agree that a well-formed URL is what java.net.URL will accept
in the constructor's argument? Then www.example.org will fail, but
http://www.example.org (without a trailing slash) will pass."

I might even go a bit further.  See the following code in:
WebcrawlerConnector:  protected String makeDocumentIdentifier(String
parentIdentifier, String rawURL, DocumentURLFilter filter)

Thanks!
Karl



On Fri, Mar 16, 2012 at 5:52 AM, Erlend Garåsen<e.f.gara...@usit.uio.no>  wrote:
On 15.03.12 19.30, Karl Wright wrote:

A seed can be a specific html file so complaining about a trailing
slash would make that not work.  For example:

http://hello.world.com/startpage.html


I think I was a little bit unclear in my recent email. By a trailing slash,
I was thinking more about the domain name itself, e.g. www.example.org/.

I will create a Jira ticket now, but I will only focus about well-formed
URLs in the seeds list.

Do you agree that a well-formed URL is what java.net.URL will accept in the
constructor's argument? Then www.example.org will fail, but
http://www.example.org (without a trailing slash) will pass.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to