[jira] [Commented] (CONNECTORS-1249) Keep a separate document priority queue per job, and synchronize with any running jobs on job start

Karl Wright (JIRA) Fri, 04 Dec 2015 22:54:27 -0800

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15042714#comment-15042714
 ]


Karl Wright commented on CONNECTORS-1249:
-----------------------------------------

Started (finally) looking at this.

Prioritization is done by bin.  Bins are named, and those names are currently 
global.  So all connections that create documents with the same bin name will 
collide with one another.

Bins are tracked by BinManager.java, with these methods:

{code}
public double[] getIncrementBinValues(String binName, double newBinValue, int 
count)
{code}

and

{code}
public double[] getIncrementBinValuesInTransaction(String binName, double 
newBinValue, int count)
{code}

The bin name is limited to 255 characters.

So there are two required actions to attack this:
(1) We augment the connection's provided bin name with additional information, 
such as job ID, connector name, etc;
(2) We make sure that all connectors provide a reasonable bin name that will 
NOT likely collide from job to job, e.g. the host name of the connection.

For (1), using the job ID is problematic, because bin-based throttling is 
supposed to prevent specific machines/services from being overwhelmed.  But we 
could use the connector class name as a distinguishing factor, adding that 
field to the BinManager as a way of at least segregating documents by service.

For (2), we merely just need to audit the connectors.


> Keep a separate document priority queue per job, and synchronize with any 
> running jobs on job start
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1249
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1249
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 2.2
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 2.3
>
>
> Starting a job when there has been already a long-running job in MCF takes a 
> very long time, because the documents from the new job don't get processed 
> until the other jobs' current backlog at the time the new job was started go 
> away.
> Effectively, this is because there is only one stream of document priorities, 
> and all jobs tap into that.  But there's no reason why we can't have multiple 
> document priority streams, one per active job, with some redesign work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CONNECTORS-1249) Keep a separate document priority queue per job, and synchronize with any running jobs on job start

Reply via email to