Hi,

2.22 makes no changes to the way document deletions are processed over
probably 10 previous versions of ManifoldCF.

What likely is the case is that the connection to the output for the job
you are cleaning up is down.  When that happens, the documents are queued
but the delete worker threads cannot make any progress.

You can see this maybe by looking at the "Simple Reports" for the job in
question and see what it is doing and why the deletions are not succeeding.

Karl


On Sun, Jan 29, 2023 at 8:18 AM Artem Abeleshev <
artem.abeles...@rondhuit.com> wrote:

> Hi, everyone!
>
> Another problem that I got sometimes. We are using ManifoldCF 2.22.1 with
> multiple nodes in our production. The creation of the MCF job pipeline is
> handled via the API calls from our service. We create jobs, repositories
> and output repositories. The crawler extracts documents and then they are
> pushed to the Solr. The pipeline works OK.
>
> The problem is about deleteing the job. Sometimes the job get stucked with
> a `Cleaning up` status (in DB it has status `e` that corresponds to status
> `STATUS_DELETING`). This time I have used MCF Web Admin to delete the job
> (pressed the delete button on the job list page).
>
> I have checked sources and debug it a bit. The method
> `deleteJobsReadyForDelete()`
> (`org.apache.manifoldcf.crawler.jobs.JobManager.deleteJobsReadyForDelete()`)
> is works OK. It is unable to delete the job cause it still found some
> documents in the document's queue table. The following SQL is executed
> within this method:
>
> ```sql
> select id from jobqueue where jobid = '1658215015582' and (status = 'E' or
> status = 'D') limit 1;
> ```
>
> where `E` status stands for `STATUS_ELIGIBLEFORDELETE` and `D` status
> stands for `STATUS_BEINGDELETED`. If at least one of such a documents is
> found in the queue it will do nothing. At the moment I had a lot of
> documents resided within the `jobqueue` having indicated statuses (actually
> all of them have `D` status).
>
> I see that `Documents delete stuffer thread` is running, and it set status
> `STATUS_BEINGDELETED` to the documents via the
> `getNextDeletableDocuments()` method
> (`org.apache.manifoldcf.crawler.jobs.JobManager.getNextDeletableDocuments(String,
> int, long)`). But I can't find any logic that actually deletes the
> documents. I've searched throught the sources, but status
> `STATUS_BEINGDELETED` mentioned mostly in `NOT EXISTS ...` queries.
> Searching in reverse order from `JobQueue`
> (`org.apache.manifoldcf.crawler.jobs.JobQueue`) also doesn't give result to
> me. I will be appreciated if somewone can point where to look, so I can
> debug and check what conditions are preventing documents to be removed.
>
> Thank you!
>
> With respect,
> Artem Abeleshev
>

Reply via email to