[jira] [Commented] (CONNECTORS-1118) Documents processed by the shared drive connector incur an unnecessary synchronisation hit

Karl Wright (JIRA) Tue, 09 Dec 2014 08:05:03 -0800

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239576#comment-14239576
 ]


Karl Wright commented on CONNECTORS-1118:
-----------------------------------------

Hi Aeham,

Both trunk and dev_1x branches already should have a fix for the second problem:

{code}
        rt.clearPreloadRequests();
        for (int j = 0; j < docidHashes.length; j++)
        {
          DocumentReference dr = set.get(j);
          docidHashes[j] = dr.getLocalIdentifierHash();
          docids[j] = dr.getLocalIdentifier();
          dataNames[j] = dr.getDataNames();
          dataValues[j] = dr.getDataValues();
          eventNames[j] = dr.getPrerequisiteEventNames();

          // Calculate desired document priority based on current queuetracker 
status.
          String[] bins = 
ManifoldCF.calculateBins(connector,dr.getLocalIdentifier());
          PriorityCalculator p = new PriorityCalculator(rt,connection,bins);
          priorities[j] = p;
          p.makePreloadRequest();
        }
        rt.preloadBinValues();
{code}

As for the first issue, it is not trivial to fix without changing the entire 
IIncrementalIngester API.  I'll have to consider how best to deal with that.

> Documents processed by the shared drive connector incur an unnecessary 
> synchronisation hit
> ------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1118
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1118
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework core
>    Affects Versions: ManifoldCF 1.7.2
>            Reporter: Aeham Abushwashi
>
> Each document processed by the shared drive connector is passed through 
> SharedDriveConnector#checkInclude to verify whether the document is eligible 
> for ingestion. The calls made here to 
> WorkerThread$ProcessActivity#checkMimeTypeIndexable and 
> WorkerThread$ProcessActivity#checkLengthIndexable are unnecessarily costly as 
> they each create a fresh instance of IncrementalIngester$PipelineConnections 
> on every call. The constructor of IncrementalIngester$PipelineConnections can 
> be very expensive due to the loading of output connection objects, which in 
> turn requires some locking (via ZK - in a distrubuted environment).
> The other area of inefficiency is in 
> WorkerThread$ProcessActivity#processDocumentReferences. This method creates 
> new instances of PriorityCalculator using the less-efficient 3-arg 
> constructor. This can be addressed using the same pattern implemented for 
> CONNECTORS-1094
> To highlight the impact of the above calls, I profiled an active worker 
> thread for 40 minutes. During that window, it spent ~23 minutes in 
> SharedDriveConnector#checkInclude and its callees + 9 minutes creating 
> instances of PriorityCalculator.
> I've seen the above issues when using the shared drive connector but I think 
> other connectors too could be impacted - depending on how they're implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CONNECTORS-1118) Documents processed by the shared drive connector incur an unnecessary synchronisation hit

Reply via email to