[jira] [Commented] (CONNECTORS-1573) Web Crawler exclude from index matches too much?

Karl Wright (JIRA) Thu, 24 Jan 2019 15:15:26 -0800


    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751689#comment-16751689
 ]


Karl Wright commented on CONNECTORS-1573:
-----------------------------------------

Questions like this should be asked to the us...@manifoldcf.apache.org list, 
not via a ticket.

The quick answer: if you look at the simple history, you can tell whether the 
pages are fetched or not.  If they are not fetched at all (that is, they do not 
appear), then your inclusion and exclusion list is wrong.  That doesn't sound 
like it's the problem here; it sounds like *after* fetching it's being blocked. 
 There are a number of reasons for that; the Simple History should give you a 
good idea which answer it is.  If it reports "JOBDESCRIPTION", that means that 
the *indexing* inclusion/exclusion rule discarded it   This is not the same as 
the *fetching* include/exclusion rules, which is what it sounds like you might 
be setting.  They're on the same tabs, just farther down.  The manual does not 
include the indexing rules sections; this should be addressed.

I suspect that, based on the regexps you given, you're also overlooking the 
fact that if the regexp matches ANYWHERE in the URL it is considered a match.  
So if you want a very specific URL, you need to delimit it with ^ at the 
beginning and $ at the end, to insure that the entire URL matches and ONLY that 
URL.




> Web Crawler exclude from index matches too much?
> ------------------------------------------------
>
>                 Key: CONNECTORS-1573
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1573
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 2.10
>            Reporter: Korneel Staelens
>            Priority: Major
>
> Hello, 
> I'm not sure this is a bug, or my misinterpretation of the exclusion rules:
> I want to set-up a rule, so that it does NOT index a parentpage, but does 
> index all childpages of that parent:
> I'm setting up a rule: 
> Inclusions: 
> .*
>  
> Exclustions:
> [http://www.website.com/nl/]
> (I've tried also: http://www.website.com/nl/(\s)* )
> No dice, I'f I'm looking at the logs, I see the pages are crawled, but not 
> indexed due to job restriction. Is my rule wrong? Or is this a small bug?
>  
> Thanks for advice!
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1573) Web Crawler exclude from index matches too much?

Reply via email to