[jira] Commented: (CONNECTORS-146) Logic for dealing with unreachable documents at the end of a non-continuous job run does not handle hopcount and carrydown correctly

Karl Wright (JIRA) Sun, 09 Jan 2011 12:23:10 -0800

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979400#action_12979400
 ]


Karl Wright commented on CONNECTORS-146:
----------------------------------------

Look at this deeper, I realized that there is actually a fairly significant 
case buried here.  To whit:

**Problem:
Deleting records at cleanup time has a bad side effect: the carrydown 
information
of a child may well change!  Say (for example) that A->B, B->C, A->D, and D->C.
When A changes so that it no longer ->D, then D is orphaned and will be cleaned 
up.
BUT: the carrydown information for C has changed!  So, C needs reindexing.

**One solution:
Do nothing.  On the next run, the change to C will be detected.  Or will it?  
If the connector
seeding method doesn't detect the change, the change won't be detected.  So 
this will not work.

**Another solution:
Put the child documents into PENDINGPURGATORY and return the job to the active 
state at the end
of the SHUTTINGDOWN phase.  The return can be automatic; the existence of 
PENDINGPURGATORY
records when there are no remaining PURGATORY records can help the crawler 
decide.  The
 documents should go to PENDINGPURGATORY only if they
are in the COMPLETED state; they should not if they are in the PURGATORY state.

In order to implement this solution, we want to call this JobManager method:

  /** Note deletion as result of document processing by a job thread of a 
document.
  *...@param documentDescriptions are the set of description objects for the 
documents that were processed.
  *...@param hopcountMethod describes how to handle deletions for hopcount 
purposes.
  *...@return the set of documents for which carrydown data was changed by this 
operation.  These documents are likely
  *  to be requeued as a result of the change.
  */
  public DocumentDescription[] markDocumentDeletedMultiple(Long jobID, String[] 
legalLinkTypes, DocumentDescription[] documentDescriptions,
    int hopcountMethod)
    throws ManifoldCFException


**Problem:
Need to get legallinktypes and hopcountmethod in order to call this method 
instead of the method

  public void cleanupIngestedDocumentIdentifiers(DocumentDescription[] 
identifiers)
    throws ManifoldCFException

... which we call today.

**Solution: I have all the necessary information in DocumentCleanupThread.  I 
just need to rework the
thread code to correlate it properly to do the right thing.

**Problem:
When, during cleanup stuffing, I detect legal documents for cleanup that are 
shared with other jobs, but are not active,
what should I do?

**Solution:
Since the right database cleanup involves calling 
markDocumentDeletedMultiple(), the documents
must still be queued, with a signal flag that tells DocumentCleanupThread not 
to actually delete it
from the index.  But, is there a race condition here?  Since we cannot queue 
the same document for another
job until the processing is complete, there probably isn't.  But we need to add 
a special bit to the
queue, which signals whether to delete the document from the search index or 
not, and also change
both the stuffer and the cleanup threads to do the right thing with that bit.


> Logic for dealing with unreachable documents at the end of a non-continuous 
> job run does not handle hopcount and carrydown correctly
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-146
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-146
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework crawler agent
>            Reporter: Karl Wright
>
> The same logic is used for deleting document that belong to jobs that are 
> going away, and jobs that are just cleaning up after a crawl.  A shortcut in 
> the logic makes it only appropriate at this time for jobs that are going away 
> entirely.  No hopcount or carrydown cleanup is ever done, for instance.
> A solution may involve having separate stuffer and worker threads for these 
> two circumstances.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CONNECTORS-146) Logic for dealing with unreachable documents at the end of a non-continuous job run does not handle hopcount and carrydown correctly

Reply via email to