Re: Excluding html files and following links

2011-06-21 Thread Erlend Garåsen


Sure, I can create a ticket. But first I want to discuss this issue with 
the two search consultants we have hired.


I decided to post to the dev list in order to get some feedback on this 
issue.


Erlend

On 20.06.11 18.00, Karl Wright wrote:

Hi Erlend,

The inclusions and exclusions are based solely on URL, and block the
connector from fetching the file.  Otherwise you would easily wind up
fetching the entire web.

However, this raises an interesting issue as to whether there's a way
in the web connector to do what you are trying to do, which is to
filter based on URL after links have been extracted.  The current
inclusions/exclusions work fine for any URLs without links but do not
allow for the case you are looking for.

Can you create a ticket?  The suggestion would be to introduce
post-extraction inclusions and exclusions into the connector.

Karl


On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen
  wrote:


I just realized that if I exclude html files for a job, links in these files
will not be followed. Is this a desirable behaviour? Should links be
followed regardless of the exclude filter?

I discovered this issue when I was going to crawl only pdfs and realized
that the job ended without finding any documents at all. I think I had
something like this in my include list:
http://foreninger.uio.no/.*\.pdf$
http://folk.uio.no/.*\.pdf$

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Excluding html files and following links

2011-06-21 Thread Karl Wright
Sure.  But you've already convinced me we need a new feature. ;-)

Karl

On Tue, Jun 21, 2011 at 3:50 AM, Erlend Garåsen  wrote:
>
> Sure, I can create a ticket. But first I want to discuss this issue with the
> two search consultants we have hired.
>
> I decided to post to the dev list in order to get some feedback on this
> issue.
>
> Erlend
>
> On 20.06.11 18.00, Karl Wright wrote:
>>
>> Hi Erlend,
>>
>> The inclusions and exclusions are based solely on URL, and block the
>> connector from fetching the file.  Otherwise you would easily wind up
>> fetching the entire web.
>>
>> However, this raises an interesting issue as to whether there's a way
>> in the web connector to do what you are trying to do, which is to
>> filter based on URL after links have been extracted.  The current
>> inclusions/exclusions work fine for any URLs without links but do not
>> allow for the case you are looking for.
>>
>> Can you create a ticket?  The suggestion would be to introduce
>> post-extraction inclusions and exclusions into the connector.
>>
>> Karl
>>
>>
>> On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen
>>   wrote:
>>>
>>> I just realized that if I exclude html files for a job, links in these
>>> files
>>> will not be followed. Is this a desirable behaviour? Should links be
>>> followed regardless of the exclude filter?
>>>
>>> I discovered this issue when I was going to crawl only pdfs and realized
>>> that the job ended without finding any documents at all. I think I had
>>> something like this in my include list:
>>> http://foreninger.uio.no/.*\.pdf$
>>> http://folk.uio.no/.*\.pdf$
>>>
>>> Erlend
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>> 31050
>>>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>