[ https://issues.apache.org/jira/browse/CONNECTORS-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979400#action_12979400 ]
Karl Wright commented on CONNECTORS-146: ---------------------------------------- Look at this deeper, I realized that there is actually a fairly significant case buried here. To whit: **Problem: Deleting records at cleanup time has a bad side effect: the carrydown information of a child may well change! Say (for example) that A->B, B->C, A->D, and D->C. When A changes so that it no longer ->D, then D is orphaned and will be cleaned up. BUT: the carrydown information for C has changed! So, C needs reindexing. **One solution: Do nothing. On the next run, the change to C will be detected. Or will it? If the connector seeding method doesn't detect the change, the change won't be detected. So this will not work. **Another solution: Put the child documents into PENDINGPURGATORY and return the job to the active state at the end of the SHUTTINGDOWN phase. The return can be automatic; the existence of PENDINGPURGATORY records when there are no remaining PURGATORY records can help the crawler decide. The documents should go to PENDINGPURGATORY only if they are in the COMPLETED state; they should not if they are in the PURGATORY state. In order to implement this solution, we want to call this JobManager method: /** Note deletion as result of document processing by a job thread of a document. *...@param documentDescriptions are the set of description objects for the documents that were processed. *...@param hopcountMethod describes how to handle deletions for hopcount purposes. *...@return the set of documents for which carrydown data was changed by this operation. These documents are likely * to be requeued as a result of the change. */ public DocumentDescription[] markDocumentDeletedMultiple(Long jobID, String[] legalLinkTypes, DocumentDescription[] documentDescriptions, int hopcountMethod) throws ManifoldCFException **Problem: Need to get legallinktypes and hopcountmethod in order to call this method instead of the method public void cleanupIngestedDocumentIdentifiers(DocumentDescription[] identifiers) throws ManifoldCFException ... which we call today. **Solution: I have all the necessary information in DocumentCleanupThread. I just need to rework the thread code to correlate it properly to do the right thing. **Problem: When, during cleanup stuffing, I detect legal documents for cleanup that are shared with other jobs, but are not active, what should I do? **Solution: Since the right database cleanup involves calling markDocumentDeletedMultiple(), the documents must still be queued, with a signal flag that tells DocumentCleanupThread not to actually delete it from the index. But, is there a race condition here? Since we cannot queue the same document for another job until the processing is complete, there probably isn't. But we need to add a special bit to the queue, which signals whether to delete the document from the search index or not, and also change both the stuffer and the cleanup threads to do the right thing with that bit. > Logic for dealing with unreachable documents at the end of a non-continuous > job run does not handle hopcount and carrydown correctly > ------------------------------------------------------------------------------------------------------------------------------------ > > Key: CONNECTORS-146 > URL: https://issues.apache.org/jira/browse/CONNECTORS-146 > Project: ManifoldCF > Issue Type: Bug > Components: Framework crawler agent > Reporter: Karl Wright > > The same logic is used for deleting document that belong to jobs that are > going away, and jobs that are just cleaning up after a crawl. A shortcut in > the logic makes it only appropriate at this time for jobs that are going away > entirely. No hopcount or carrydown cleanup is ever done, for instance. > A solution may involve having separate stuffer and worker threads for these > two circumstances. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.