[ https://issues.apache.org/jira/browse/CONNECTORS-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karl Wright reassigned CONNECTORS-1547: --------------------------------------- Assignee: Karl Wright > No activity record for for excluded documents in WebCrawlerConnector > -------------------------------------------------------------------- > > Key: CONNECTORS-1547 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1547 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector > Reporter: Olivier Tavard > Assignee: Karl Wright > Priority: Minor > Attachments: manifoldcf_local_files.log, manifoldcf_web.log, > simple_history_files.jpg, simple_history_web.jpg > > > Hi, > I noticed that there is no activity record logged for documents excluded by > the Document Filter transformation connector in the WebCrawler connector. > To reproduce the issue on MCF out of the box : > Null output connector > Web repository connector > Job : > - DocumentFilter added which only accepts application/msword (doc/docx) > documents > The simple history does not mention the documents excluded (excepted for html > documents). They have fetch activity and that's all (see > simple_history_web.jpeg). > We can only see the documents excluded by the MCF log (with DEBUG verbosity > activity on connectors) : > {code:java} > Removing url > 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png' > because it had the wrong content type ('image/png'){code} > (see manifoldcf_local_files.log) > The related code is in WebcrawlerConnector.java l.904 : > {code:java} > fetchStatus.contextMessage = "it had the wrong content type > ('"+contentType+"')"; > fetchStatus.resultSignal = RESULT_NO_DOCUMENT; > activityResultCode = null;{code} > The activityResultCode is null. > > > If we configure the same job but for a Local File system connector with the > same Document Filter transformation connector, the simple history mentions > all the documents excluded in the simple history (see > simple_history_files.jpeg) and the code mentions a specific error code with > an activity record logged (class FileConnector l. 415) : > {code:java} > if (!activities.checkMimeTypeIndexable(mimeType)) > { > errorCode = activities.EXCLUDED_MIMETYPE; > errorDesc = "Excluded because mime type ('"+mimeType+"')"; > Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because > mime type ('"+mimeType+"') was excluded by output connector."); > activities.noDocument(documentIdentifier,versionString); > continue; > }{code} > > So the Web Crawler connector should have the same behaviour than for > FileConnector and explicitly mention all the documents excluded by the user I > think. > > Best regards, > Olivier -- This message was sent by Atlassian JIRA (v7.6.3#76005)