[ 
https://issues.apache.org/jira/browse/CONNECTORS-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1547:
---------------------------------------

    Assignee: Karl Wright

> No activity record for for excluded documents in WebCrawlerConnector
> --------------------------------------------------------------------
>
>                 Key: CONNECTORS-1547
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1547
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>            Reporter: Olivier Tavard
>            Assignee: Karl Wright
>            Priority: Minor
>         Attachments: manifoldcf_local_files.log, manifoldcf_web.log, 
> simple_history_files.jpg, simple_history_web.jpg
>
>
> Hi,
> I noticed that there is no activity record logged for documents excluded by 
> the Document Filter transformation connector  in the WebCrawler connector.
> To reproduce the issue on MCF out of the box :
> Null output connector 
> Web repository connector 
> Job :
> - DocumentFilter added which only accepts application/msword (doc/docx) 
> documents
> The simple history does not mention the documents excluded (excepted for html 
> documents). They have fetch activity and that's all (see 
> simple_history_web.jpeg).
> We can only see the documents excluded by the MCF log (with DEBUG verbosity 
> activity on connectors) :
> {code:java}
> Removing url 
> 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png'
>  because it had the wrong content type ('image/png'){code}
> (see manifoldcf_local_files.log)
> The related code is in WebcrawlerConnector.java l.904 :
> {code:java}
> fetchStatus.contextMessage = "it had the wrong content type 
> ('"+contentType+"')";
>  fetchStatus.resultSignal = RESULT_NO_DOCUMENT;
>  activityResultCode = null;{code}
> The activityResultCode is null.
>  
>  
> If we configure the same job but for a Local File system connector with the 
> same Document Filter transformation connector, the simple history mentions 
> all the documents excluded in the simple history (see 
> simple_history_files.jpeg)  and the code mentions a specific error code with 
> an activity record logged (class FileConnector l. 415) : 
> {code:java}
> if (!activities.checkMimeTypeIndexable(mimeType))
>  {
>  errorCode = activities.EXCLUDED_MIMETYPE;
>  errorDesc = "Excluded because mime type ('"+mimeType+"')";
>  Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because 
> mime type ('"+mimeType+"') was excluded by output connector.");
>  activities.noDocument(documentIdentifier,versionString);
>  continue;
>  }{code}
>  
> So the Web Crawler connector should have the same behaviour than for 
> FileConnector and explicitly mention all the documents excluded by the user I 
> think.
>  
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to