[ https://issues.apache.org/jira/browse/CONNECTORS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553705#comment-14553705 ]
Karl Wright commented on CONNECTORS-1193: ----------------------------------------- Hi Arcadius, (1) The UI portions of the patch look good to me. (2) For the actual processing, it looks to me like you are loading the entire extracted content for the page into memory. That's never going to work. Whether you use tika to extract the content or just fuzzyml, you will have to use various tricks to look for the content in a stream rather than in a giant string. There is other code already in the Webconnector that does this; you might want to model your code on it. (3) I think involving Tika or fuzzyml in every web fetch decision as a matter of course is also a non-starter. It would probably reduce the performance of the web connector by an order of magnitude. In general, I would greatly prefer that if the user has specified no content to be excluded, then no extra parsing work happens. (4) Using tika and thus dealing with all kinds binary content is probably also not going to work, for performance reasons. People crawl *very* large binary documents. Web documents are typically limited in size because they need to be displayed in a browser. You could fix this in one of two ways: either only look at html content with fuzzyml (which would cover your initial use case completely), or you could limit the total characters on every document to some maximum number you set as part of the document specification. I don't think you've made a compelling case for using Tika yet though. As for integration testing, you have two possibilities. The first is simply to count documents. That does not guarantee that the correct one(s) were excluded, but it's usually reasonable to assume it if the cardinality is what you would expect. The second is to get more detailed by looking at the simple history report, which you can run via java api within your test. This should give you a precise idea of what was included and what was rejected. Thanks! > Consider adding feature to web connector to skip pages that match specified > criteria > ------------------------------------------------------------------------------------ > > Key: CONNECTORS-1193 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1193 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector > Affects Versions: ManifoldCF 1.10, ManifoldCF 2.2 > Reporter: Karl Wright > Assignee: Karl Wright > Fix For: ManifoldCF 1.10, ManifoldCF 2.2 > > Attachments: CONNECTORS-1193.patch > > > The user wants to skip content that matches specified criteria, because some > sites don't return a 404 code (for instance) but instead return 200 with a > textual error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)