[ https://issues.apache.org/jira/browse/CONNECTORS-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751689#comment-16751689 ]
Karl Wright commented on CONNECTORS-1573: ----------------------------------------- Questions like this should be asked to the us...@manifoldcf.apache.org list, not via a ticket. The quick answer: if you look at the simple history, you can tell whether the pages are fetched or not. If they are not fetched at all (that is, they do not appear), then your inclusion and exclusion list is wrong. That doesn't sound like it's the problem here; it sounds like *after* fetching it's being blocked. There are a number of reasons for that; the Simple History should give you a good idea which answer it is. If it reports "JOBDESCRIPTION", that means that the *indexing* inclusion/exclusion rule discarded it This is not the same as the *fetching* include/exclusion rules, which is what it sounds like you might be setting. They're on the same tabs, just farther down. The manual does not include the indexing rules sections; this should be addressed. I suspect that, based on the regexps you given, you're also overlooking the fact that if the regexp matches ANYWHERE in the URL it is considered a match. So if you want a very specific URL, you need to delimit it with ^ at the beginning and $ at the end, to insure that the entire URL matches and ONLY that URL. > Web Crawler exclude from index matches too much? > ------------------------------------------------ > > Key: CONNECTORS-1573 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1573 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector > Affects Versions: ManifoldCF 2.10 > Reporter: Korneel Staelens > Priority: Major > > Hello, > I'm not sure this is a bug, or my misinterpretation of the exclusion rules: > I want to set-up a rule, so that it does NOT index a parentpage, but does > index all childpages of that parent: > I'm setting up a rule: > Inclusions: > .* > > Exclustions: > [http://www.website.com/nl/] > (I've tried also: http://www.website.com/nl/(\s)* ) > No dice, I'f I'm looking at the logs, I see the pages are crawled, but not > indexed due to job restriction. Is my rule wrong? Or is this a small bug? > > Thanks for advice! > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)