Re: Possible bug in seeds list (web connector)

2012-03-20 Thread Erlend Garåsen


I have created a ticket for this and entered some unfinished JavaScript 
code. My head does not work very well with regular expressions today, so 
I will improve the code tomorrow.


Please write a few lines about what I need to do with the Python browser 
simulator in order to test the JavaScript.


Erlend

On 20.03.12 14.14, Erlend Garåsen wrote:


This sounds very ok. I have already written the necessary JavaScript
code, and it works as it should. I didn't create a ticket because I
needed time to figure out the best solution and in order to learn more
about how the connector works by reading the Java code.

I will create a ticket right away and include the JavaScript, but I
think I will create a patch as well before I commit my work.

Erlend

On 20.03.12 13.59, Karl Wright wrote:

I think this is a reasonable approach. You may need to modify the
python browser simulator, though, to keep the UI tests working. I can
help you with that when the time comes.

If you create a ticket and include your proposed Javascript, I can
review it and let you know how challenging I think it will be to
support it in the browser simulator. Also, since we are trying to get
a release out the door, I think it makes sense to hold off on these
changes until I can make the release branch. Sound OK?

Thanks!
Karl


On Tue, Mar 20, 2012 at 8:54 AM, Erlend
Garåsen wrote:


I think it will be much easier to validate the seeds list by using
JavaScript instead of parsing urls with java.net.URL, simply because
this is
how we do validation elsewhere in the application.

Checking for valid URLs, supported protocols and illegal characters
shouldn't be very complicated by using JavaScript.

What do you think?

Erlend


On 16.03.12 11.51, Karl Wright wrote:


"Do you agree that a well-formed URL is what java.net.URL will accept
in the constructor's argument? Then www.example.org will fail, but
http://www.example.org (without a trailing slash) will pass."

I might even go a bit further. See the following code in:
WebcrawlerConnector: protected String makeDocumentIdentifier(String
parentIdentifier, String rawURL, DocumentURLFilter filter)

Thanks!
Karl



On Fri, Mar 16, 2012 at 5:52 AM, Erlend
Garåsen
wrote:


On 15.03.12 19.30, Karl Wright wrote:



A seed can be a specific html file so complaining about a trailing
slash would make that not work. For example:

http://hello.world.com/startpage.html




I think I was a little bit unclear in my recent email. By a trailing
slash,
I was thinking more about the domain name itself, e.g.
www.example.org/.

I will create a Jira ticket now, but I will only focus about
well-formed
URLs in the seeds list.

Do you agree that a well-formed URL is what java.net.URL will
accept in
the
constructor's argument? Then www.example.org will fail, but
http://www.example.org (without a trailing slash) will pass.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
31050




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
31050






--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Possible bug in seeds list (web connector)

2012-03-20 Thread Erlend Garåsen


This sounds very ok. I have already written the necessary JavaScript 
code, and it works as it should. I didn't create a ticket because I 
needed time to figure out the best solution and in order to learn more 
about how the connector works by reading the Java code.


I will create a ticket right away and include the JavaScript, but I 
think I will create a patch as well before I commit my work.


Erlend

On 20.03.12 13.59, Karl Wright wrote:

I think this is a reasonable approach.  You may need to modify the
python browser simulator, though, to keep the UI tests working.  I can
help you with that when the time comes.

If you create a ticket and include your proposed Javascript, I can
review it and let you know how challenging I think it will be to
support it in the browser simulator.  Also, since we are trying to get
a release out the door, I think it makes sense to hold off on these
changes until I can make the release branch.  Sound OK?

Thanks!
Karl


On Tue, Mar 20, 2012 at 8:54 AM, Erlend Garåsen  wrote:


I think it will be much easier to validate the seeds list by using
JavaScript instead of parsing urls with java.net.URL, simply because this is
how we do validation elsewhere in the application.

Checking for valid URLs, supported protocols and illegal characters
shouldn't be very complicated by using JavaScript.

What do you think?

Erlend


On 16.03.12 11.51, Karl Wright wrote:


"Do you agree that a well-formed URL is what java.net.URL will accept
in the constructor's argument? Then www.example.org will fail, but
http://www.example.org (without a trailing slash) will pass."

I might even go a bit further.  See the following code in:
WebcrawlerConnector:  protected String makeDocumentIdentifier(String
parentIdentifier, String rawURL, DocumentURLFilter filter)

Thanks!
Karl



On Fri, Mar 16, 2012 at 5:52 AM, Erlend Garåsen
  wrote:


On 15.03.12 19.30, Karl Wright wrote:



A seed can be a specific html file so complaining about a trailing
slash would make that not work.  For example:

http://hello.world.com/startpage.html




I think I was a little bit unclear in my recent email. By a trailing
slash,
I was thinking more about the domain name itself, e.g. www.example.org/.

I will create a Jira ticket now, but I will only focus about well-formed
URLs in the seeds list.

Do you agree that a well-formed URL is what java.net.URL will accept in
the
constructor's argument? Then www.example.org will fail, but
http://www.example.org (without a trailing slash) will pass.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
31050




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Possible bug in seeds list (web connector)

2012-03-20 Thread Karl Wright
I think this is a reasonable approach.  You may need to modify the
python browser simulator, though, to keep the UI tests working.  I can
help you with that when the time comes.

If you create a ticket and include your proposed Javascript, I can
review it and let you know how challenging I think it will be to
support it in the browser simulator.  Also, since we are trying to get
a release out the door, I think it makes sense to hold off on these
changes until I can make the release branch.  Sound OK?

Thanks!
Karl


On Tue, Mar 20, 2012 at 8:54 AM, Erlend Garåsen  wrote:
>
> I think it will be much easier to validate the seeds list by using
> JavaScript instead of parsing urls with java.net.URL, simply because this is
> how we do validation elsewhere in the application.
>
> Checking for valid URLs, supported protocols and illegal characters
> shouldn't be very complicated by using JavaScript.
>
> What do you think?
>
> Erlend
>
>
> On 16.03.12 11.51, Karl Wright wrote:
>>
>> "Do you agree that a well-formed URL is what java.net.URL will accept
>> in the constructor's argument? Then www.example.org will fail, but
>> http://www.example.org (without a trailing slash) will pass."
>>
>> I might even go a bit further.  See the following code in:
>> WebcrawlerConnector:  protected String makeDocumentIdentifier(String
>> parentIdentifier, String rawURL, DocumentURLFilter filter)
>>
>> Thanks!
>> Karl
>>
>>
>>
>> On Fri, Mar 16, 2012 at 5:52 AM, Erlend Garåsen
>>  wrote:
>>>
>>> On 15.03.12 19.30, Karl Wright wrote:


 A seed can be a specific html file so complaining about a trailing
 slash would make that not work.  For example:

 http://hello.world.com/startpage.html
>>>
>>>
>>>
>>> I think I was a little bit unclear in my recent email. By a trailing
>>> slash,
>>> I was thinking more about the domain name itself, e.g. www.example.org/.
>>>
>>> I will create a Jira ticket now, but I will only focus about well-formed
>>> URLs in the seeds list.
>>>
>>> Do you agree that a well-formed URL is what java.net.URL will accept in
>>> the
>>> constructor's argument? Then www.example.org will fail, but
>>> http://www.example.org (without a trailing slash) will pass.
>>>
>>>
>>> Erlend
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>> 31050
>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Possible bug in seeds list (web connector)

2012-03-20 Thread Erlend Garåsen


I think it will be much easier to validate the seeds list by using 
JavaScript instead of parsing urls with java.net.URL, simply because 
this is how we do validation elsewhere in the application.


Checking for valid URLs, supported protocols and illegal characters 
shouldn't be very complicated by using JavaScript.


What do you think?

Erlend

On 16.03.12 11.51, Karl Wright wrote:

"Do you agree that a well-formed URL is what java.net.URL will accept
in the constructor's argument? Then www.example.org will fail, but
http://www.example.org (without a trailing slash) will pass."

I might even go a bit further.  See the following code in:
WebcrawlerConnector:  protected String makeDocumentIdentifier(String
parentIdentifier, String rawURL, DocumentURLFilter filter)

Thanks!
Karl



On Fri, Mar 16, 2012 at 5:52 AM, Erlend Garåsen  wrote:

On 15.03.12 19.30, Karl Wright wrote:


A seed can be a specific html file so complaining about a trailing
slash would make that not work.  For example:

http://hello.world.com/startpage.html



I think I was a little bit unclear in my recent email. By a trailing slash,
I was thinking more about the domain name itself, e.g. www.example.org/.

I will create a Jira ticket now, but I will only focus about well-formed
URLs in the seeds list.

Do you agree that a well-formed URL is what java.net.URL will accept in the
constructor's argument? Then www.example.org will fail, but
http://www.example.org (without a trailing slash) will pass.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Possible bug in seeds list (web connector)

2012-03-16 Thread Karl Wright
"Do you agree that a well-formed URL is what java.net.URL will accept
in the constructor's argument? Then www.example.org will fail, but
http://www.example.org (without a trailing slash) will pass."

I might even go a bit further.  See the following code in:
WebcrawlerConnector:  protected String makeDocumentIdentifier(String
parentIdentifier, String rawURL, DocumentURLFilter filter)

Thanks!
Karl



On Fri, Mar 16, 2012 at 5:52 AM, Erlend Garåsen  wrote:
> On 15.03.12 19.30, Karl Wright wrote:
>>
>> A seed can be a specific html file so complaining about a trailing
>> slash would make that not work.  For example:
>>
>> http://hello.world.com/startpage.html
>
>
> I think I was a little bit unclear in my recent email. By a trailing slash,
> I was thinking more about the domain name itself, e.g. www.example.org/.
>
> I will create a Jira ticket now, but I will only focus about well-formed
> URLs in the seeds list.
>
> Do you agree that a well-formed URL is what java.net.URL will accept in the
> constructor's argument? Then www.example.org will fail, but
> http://www.example.org (without a trailing slash) will pass.
>
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Possible bug in seeds list (web connector)

2012-03-16 Thread Erlend Garåsen

On 15.03.12 19.30, Karl Wright wrote:

A seed can be a specific html file so complaining about a trailing
slash would make that not work.  For example:

http://hello.world.com/startpage.html


I think I was a little bit unclear in my recent email. By a trailing 
slash, I was thinking more about the domain name itself, e.g. 
www.example.org/.


I will create a Jira ticket now, but I will only focus about well-formed 
URLs in the seeds list.


Do you agree that a well-formed URL is what java.net.URL will accept in 
the constructor's argument? Then www.example.org will fail, but 
http://www.example.org (without a trailing slash) will pass.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Possible bug in seeds list (web connector)

2012-03-15 Thread Karl Wright
A seed can be a specific html file so complaining about a trailing
slash would make that not work.  For example:

http://hello.world.com/startpage.html

So I think checking for well-formed URL is the right level of support
in the UI, and that's probably enough.

Karl


On Thu, Mar 15, 2012 at 2:26 PM, Erlend Garåsen  wrote:
>
> But it does not make sense to me that "www.uio.no" will be accepted in the
> seeds list when the consequence is that no URLs will be fetched, even though
> you do not include anything else into the "include in crawl" list.
>
> I agree. The UI should complain instead of silently changing the format of
> the URL. Do you thing the UI should return an error message about a missing
> trailing slash or should it just complain about a missing leading protocol?
>
> At least, it should complain about an invalid URL since it seems to accept
> almost anything typed into the text box.
>
> Erlend
>
>
>
> On 15.03.12 18.55, Karl Wright wrote:
>>
>> But this makes sense, actually.  The url "http://www.uio.no"; does not
>> actually match the regexp "http://www.uio.no/.*";, so it is ditched.
>>
>> The proposal to silently modify the seed according to some criteria
>> makes me nervous.  I'd much rather the UI caught and complained about
>> seeds that were non-conforming than have something silent happen under
>> the covers.
>>
>> Karl
>>
>>
>> On Thu, Mar 15, 2012 at 1:47 PM, Erlend Garåsen
>>  wrote:
>>>
>>>
>>> If I add the following URL into my seeds list:
>>> http://www.uio.no
>>> and this into the "include in crawl" list:
>>> http://www.uio.no/.*
>>> the job will just end shortly after it starts without fetching anything
>>> at
>>> all. If I add the missing trailing slash into my seeds url list
>>> (http://www.uio.no/), it works as it should.
>>>
>>> I also discovered another similar behaviour. If I add the following into
>>> my
>>> seeds list:
>>> www.uio.no
>>> select the "include only hosts matching seeds?" option and do not add
>>> anything into the "include in crawl", the same thing happen. No URLs will
>>> be
>>> fetched.
>>>
>>> I suggest that we do something like this:
>>> - A URL in the Java code will always start with
>>> "http(s)://www.myhost.com/
>>> - If you fail to add the protocol or the trailing slash, it will be added
>>> automatically instead of returning an error message.
>>>
>>> By "in the Java code", I mean that it should automatically be formatted
>>> like
>>> this before we do a regular expression match.
>>>
>>> Erlend
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>> 31050
>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Possible bug in seeds list (web connector)

2012-03-15 Thread Erlend Garåsen


But it does not make sense to me that "www.uio.no" will be accepted in 
the seeds list when the consequence is that no URLs will be fetched, 
even though you do not include anything else into the "include in crawl" 
list.


I agree. The UI should complain instead of silently changing the format 
of the URL. Do you thing the UI should return an error message about a 
missing trailing slash or should it just complain about a missing 
leading protocol?


At least, it should complain about an invalid URL since it seems to 
accept almost anything typed into the text box.


Erlend


On 15.03.12 18.55, Karl Wright wrote:

But this makes sense, actually.  The url "http://www.uio.no"; does not
actually match the regexp "http://www.uio.no/.*";, so it is ditched.

The proposal to silently modify the seed according to some criteria
makes me nervous.  I'd much rather the UI caught and complained about
seeds that were non-conforming than have something silent happen under
the covers.

Karl


On Thu, Mar 15, 2012 at 1:47 PM, Erlend Garåsen  wrote:


If I add the following URL into my seeds list:
http://www.uio.no
and this into the "include in crawl" list:
http://www.uio.no/.*
the job will just end shortly after it starts without fetching anything at
all. If I add the missing trailing slash into my seeds url list
(http://www.uio.no/), it works as it should.

I also discovered another similar behaviour. If I add the following into my
seeds list:
www.uio.no
select the "include only hosts matching seeds?" option and do not add
anything into the "include in crawl", the same thing happen. No URLs will be
fetched.

I suggest that we do something like this:
- A URL in the Java code will always start with "http(s)://www.myhost.com/
- If you fail to add the protocol or the trailing slash, it will be added
automatically instead of returning an error message.

By "in the Java code", I mean that it should automatically be formatted like
this before we do a regular expression match.

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Possible bug in seeds list (web connector)

2012-03-15 Thread Karl Wright
But this makes sense, actually.  The url "http://www.uio.no"; does not
actually match the regexp "http://www.uio.no/.*";, so it is ditched.

The proposal to silently modify the seed according to some criteria
makes me nervous.  I'd much rather the UI caught and complained about
seeds that were non-conforming than have something silent happen under
the covers.

Karl


On Thu, Mar 15, 2012 at 1:47 PM, Erlend Garåsen  wrote:
>
> If I add the following URL into my seeds list:
> http://www.uio.no
> and this into the "include in crawl" list:
> http://www.uio.no/.*
> the job will just end shortly after it starts without fetching anything at
> all. If I add the missing trailing slash into my seeds url list
> (http://www.uio.no/), it works as it should.
>
> I also discovered another similar behaviour. If I add the following into my
> seeds list:
> www.uio.no
> select the "include only hosts matching seeds?" option and do not add
> anything into the "include in crawl", the same thing happen. No URLs will be
> fetched.
>
> I suggest that we do something like this:
> - A URL in the Java code will always start with "http(s)://www.myhost.com/
> - If you fail to add the protocol or the trailing slash, it will be added
> automatically instead of returning an error message.
>
> By "in the Java code", I mean that it should automatically be formatted like
> this before we do a regular expression match.
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Possible bug in seeds list (web connector)

2012-03-15 Thread Erlend Garåsen


If I add the following URL into my seeds list:
http://www.uio.no
and this into the "include in crawl" list:
http://www.uio.no/.*
the job will just end shortly after it starts without fetching anything 
at all. If I add the missing trailing slash into my seeds url list 
(http://www.uio.no/), it works as it should.


I also discovered another similar behaviour. If I add the following into 
my seeds list:

www.uio.no
select the "include only hosts matching seeds?" option and do not add 
anything into the "include in crawl", the same thing happen. No URLs 
will be fetched.


I suggest that we do something like this:
- A URL in the Java code will always start with "http(s)://www.myhost.com/
- If you fail to add the protocol or the trailing slash, it will be 
added automatically instead of returning an error message.


By "in the Java code", I mean that it should automatically be formatted 
like this before we do a regular expression match.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050