Sure. But you've already convinced me we need a new feature. ;-)
Karl
On Tue, Jun 21, 2011 at 3:50 AM, Erlend Garåsen wrote:
>
> Sure, I can create a ticket. But first I want to discuss this issue with the
> two search consultants we have hired.
>
> I decided to post to the dev list in order to get some feedback on this
> issue.
>
> Erlend
>
> On 20.06.11 18.00, Karl Wright wrote:
>>
>> Hi Erlend,
>>
>> The inclusions and exclusions are based solely on URL, and block the
>> connector from fetching the file. Otherwise you would easily wind up
>> fetching the entire web.
>>
>> However, this raises an interesting issue as to whether there's a way
>> in the web connector to do what you are trying to do, which is to
>> filter based on URL after links have been extracted. The current
>> inclusions/exclusions work fine for any URLs without links but do not
>> allow for the case you are looking for.
>>
>> Can you create a ticket? The suggestion would be to introduce
>> post-extraction inclusions and exclusions into the connector.
>>
>> Karl
>>
>>
>> On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen
>> wrote:
>>>
>>> I just realized that if I exclude html files for a job, links in these
>>> files
>>> will not be followed. Is this a desirable behaviour? Should links be
>>> followed regardless of the exclude filter?
>>>
>>> I discovered this issue when I was going to crawl only pdfs and realized
>>> that the job ended without finding any documents at all. I think I had
>>> something like this in my include list:
>>> http://foreninger.uio.no/.*\.pdf$
>>> http://folk.uio.no/.*\.pdf$
>>>
>>> Erlend
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>> 31050
>>>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>