[ 
https://issues.apache.org/jira/browse/CONNECTORS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14530348#comment-14530348
 ] 

Karl Wright commented on CONNECTORS-1193:
-----------------------------------------

bq. This could be inserted at any stage in the pipeline meaning that if this 
filter is inserted after html->text

That only makes sense if you aren't adding the feature to the web connector 
after all, but rather to a general content filter transformation connector.  A 
general content filter transformation connector would be able to work in 
multiple ways, yes -- and with noted performance loss -- but it would appear to 
me that this functionality is primarily applicable to web crawling.  Even the 
RSS connector does not seem to require this kind of filtering.

If you accept this reasoning and want to do this functionality in the web 
connector itself, then we would probably make it work much like the content 
match feature for session authentication, which matches only actual content, 
since it parses any HTML tags.  There would be no ability, therefore, to filter 
binary documents based on their contents, or filter documents based on their 
tag structure.





> Consider adding feature to web connector to skip pages that match specified 
> criteria
> ------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1193
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1193
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 1.10, ManifoldCF 2.2
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>
> The user wants to skip content that matches specified criteria, because some 
> sites don't return a 404 code (for instance) but instead return 200 with a 
> textual error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to