[ https://issues.apache.org/jira/browse/CONNECTORS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14530348#comment-14530348 ]
Karl Wright commented on CONNECTORS-1193: ----------------------------------------- bq. This could be inserted at any stage in the pipeline meaning that if this filter is inserted after html->text That only makes sense if you aren't adding the feature to the web connector after all, but rather to a general content filter transformation connector. A general content filter transformation connector would be able to work in multiple ways, yes -- and with noted performance loss -- but it would appear to me that this functionality is primarily applicable to web crawling. Even the RSS connector does not seem to require this kind of filtering. If you accept this reasoning and want to do this functionality in the web connector itself, then we would probably make it work much like the content match feature for session authentication, which matches only actual content, since it parses any HTML tags. There would be no ability, therefore, to filter binary documents based on their contents, or filter documents based on their tag structure. > Consider adding feature to web connector to skip pages that match specified > criteria > ------------------------------------------------------------------------------------ > > Key: CONNECTORS-1193 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1193 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector > Affects Versions: ManifoldCF 1.10, ManifoldCF 2.2 > Reporter: Karl Wright > Assignee: Karl Wright > Fix For: ManifoldCF 1.10, ManifoldCF 2.2 > > > The user wants to skip content that matches specified criteria, because some > sites don't return a 404 code (for instance) but instead return 200 with a > textual error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)