[ 
https://issues.apache.org/jira/browse/CONNECTORS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553705#comment-14553705
 ] 

Karl Wright commented on CONNECTORS-1193:
-----------------------------------------

Hi Arcadius,

(1) The UI portions of the patch look good to me.
(2) For the actual processing, it looks to me like you are loading the entire 
extracted content for the page into memory.  That's never going to work.  
Whether you use tika to extract the content or just fuzzyml, you will have to 
use various tricks to look for the content in a stream rather than in a giant 
string.  There is other code already in the Webconnector that does this; you 
might want to model your code on it.
(3) I think involving Tika or fuzzyml in every web fetch decision as a matter 
of course is also a non-starter.  It would probably reduce the performance of 
the web connector by an order of magnitude.  In general, I would greatly prefer 
that if the user has specified no content to be excluded, then no extra parsing 
work happens.
(4) Using tika and thus dealing with all kinds binary content is probably also 
not going to work, for performance reasons.  People crawl *very* large binary 
documents.  Web documents are typically limited in size because they need to be 
displayed in a browser.  You could fix this in one of two ways: either only 
look at html content with fuzzyml (which would cover your initial use case 
completely), or you could limit the total characters on every document to some 
maximum number you set as part of the document specification.  I don't think 
you've made a compelling case for using Tika yet though.

As for integration testing, you have two possibilities.  The first is simply to 
count documents.  That does not guarantee that the correct one(s) were 
excluded, but it's usually reasonable to assume it if the cardinality is what 
you would expect.  The second is to get more detailed by looking at the simple 
history report, which you can run via java api within your test.  This should 
give you a precise idea of what was included and what was rejected.

Thanks!




> Consider adding feature to web connector to skip pages that match specified 
> criteria
> ------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1193
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1193
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 1.10, ManifoldCF 2.2
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>         Attachments: CONNECTORS-1193.patch
>
>
> The user wants to skip content that matches specified criteria, because some 
> sites don't return a 404 code (for instance) but instead return 200 with a 
> textual error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to