Karl, good day! Thank you for the hint! It was very useful! Actually, you was right and the actual problem was about the connection. But I doesn't expect it would be so dramatic. Here is what I found using some debugging:
First I have found the actual code that was responsible for the deletion of the documents. It was called by the `DocumentDeleteThread` (`org.apache.manifoldcf.crawler.system.DocumentDeleteThread`). Then I checked how many `DocumentDeleteThread` threads supposed to be started. I haven't override the value and got default 10 threads. Then I grabbed thread dump and check those threads. I found two strange things: 1. Not all threads were alived. Some of them were terminated. 2. Some live threads have a huge amount of supplementary zk threads like `Worker thread 'x'-EventThread` and `Worker thread 'n'-EventThread(...)`. Even the threads that already have been termanted also leave behind theirs supplementary threads (since they are deamon threads). As a result I have from 1000 to 2000 threads in total. I starting to debug the lived threads and come up to the `deletePost` method of `HttpPoster` (`org.apache.manifoldcf.agents.output.solr.HttpPoster.deletePost(String, IOutputRemoveActivity)`). Here I was always getting an exception: ```java org.apache.solr.client.solrj.SolrServerException: IOException occurred when talking to server at: http://10.78.11.71:8983/solr/jaweb org.apache.http.conn.ConnectTimeoutException: Connect to 10.78.11.71:8983 [/ 10.78.11.71] failed: connect timed out ``` An exception was due to the Solr was unavailable (i.e. shut down), so here is no surprise. But the following was a true surpise for me. An exception I've got is of type `IOException`. Inside the `HttpPoster` that exception in the end is handled by the method `handleIOException` (org.apache.manifoldcf.agents.output.solr.HttpPoster.handleIOException(IOException, String)): ```java protected static void handleIOException(IOException e, String context) throws ManifoldCFException, ServiceInterruption { if ((e instanceof InterruptedIOException) && (!(e instanceof java.net.SocketTimeoutException))) throw new ManifoldCFException(e.getMessage(), ManifoldCFException.INTERRUPTED); ... } ``` As we can see an exception is wrapped with the `ManifoldCFException` exception and assigned with the `INTERRUPTED` error code. Then this exception is bubbling up unitl it ends up in the main loop of the `DocumentDeleteThread`. Here is the full stack I extract during debug (unfortunately not a single exception is logged on the way): ```java org.apache.manifoldcf.agents.output.solr.HttpPoster.handleIOException(HttpPoster.java:514), org.apache.manifoldcf.agents.output.solr.HttpPoster.handleSolrServerException(HttpPoster.java:427), org.apache.manifoldcf.agents.output.solr.HttpPoster.deletePost(HttpPoster.java:817), org.apache.manifoldcf.agents.output.solr.SolrConnector.removeDocument(SolrConnector.java:594), org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.removeDocument(IncrementalIngester.java:2296), org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentDeleteMultiple(IncrementalIngester.java:1037), org.apache.manifoldcf.crawler.system.DocumentDeleteThread.run(DocumentDeleteThread.java:134) ``` Inside the main loop of the `DocumentDeleteThread` that exception is handles like this: ```java public void run() { try { ... // Loop while (true) { // Do another try/catch around everything in the loop try { ... } catch (ManifoldCFException e) { if (e.getErrorCode() == ManifoldCFException.INTERRUPTED) break; ... } ... } } catch (Throwable e) { ... } } ``` It just breaks the loop making thread terminates normally! In a quite a short time I always ends up with no `DocumentDeleteThread`s at all and the framework transit to the incosistent state. In the end, I made Solr back online and managed to finish deletion successfully. But I think this case should be handled in some way. With respect, Abeleshev Artem On Sun, Jan 29, 2023 at 10:36 PM Karl Wright <daddy...@gmail.com> wrote: > Hi, > > 2.22 makes no changes to the way document deletions are processed over > probably 10 previous versions of ManifoldCF. > > What likely is the case is that the connection to the output for the job > you are cleaning up is down. When that happens, the documents are queued > but the delete worker threads cannot make any progress. > > You can see this maybe by looking at the "Simple Reports" for the job in > question and see what it is doing and why the deletions are not succeeding. > > Karl > > > On Sun, Jan 29, 2023 at 8:18 AM Artem Abeleshev < > artem.abeles...@rondhuit.com> wrote: > >> Hi, everyone! >> >> Another problem that I got sometimes. We are using ManifoldCF 2.22.1 with >> multiple nodes in our production. The creation of the MCF job pipeline is >> handled via the API calls from our service. We create jobs, repositories >> and output repositories. The crawler extracts documents and then they are >> pushed to the Solr. The pipeline works OK. >> >> The problem is about deleteing the job. Sometimes the job get stucked >> with a `Cleaning up` status (in DB it has status `e` that corresponds to >> status `STATUS_DELETING`). This time I have used MCF Web Admin to delete >> the job (pressed the delete button on the job list page). >> >> I have checked sources and debug it a bit. The method >> `deleteJobsReadyForDelete()` >> (`org.apache.manifoldcf.crawler.jobs.JobManager.deleteJobsReadyForDelete()`) >> is works OK. It is unable to delete the job cause it still found some >> documents in the document's queue table. The following SQL is executed >> within this method: >> >> ```sql >> select id from jobqueue where jobid = '1658215015582' and (status = 'E' >> or status = 'D') limit 1; >> ``` >> >> where `E` status stands for `STATUS_ELIGIBLEFORDELETE` and `D` status >> stands for `STATUS_BEINGDELETED`. If at least one of such a documents is >> found in the queue it will do nothing. At the moment I had a lot of >> documents resided within the `jobqueue` having indicated statuses (actually >> all of them have `D` status). >> >> I see that `Documents delete stuffer thread` is running, and it set >> status `STATUS_BEINGDELETED` to the documents via the >> `getNextDeletableDocuments()` method >> (`org.apache.manifoldcf.crawler.jobs.JobManager.getNextDeletableDocuments(String, >> int, long)`). But I can't find any logic that actually deletes the >> documents. I've searched throught the sources, but status >> `STATUS_BEINGDELETED` mentioned mostly in `NOT EXISTS ...` queries. >> Searching in reverse order from `JobQueue` >> (`org.apache.manifoldcf.crawler.jobs.JobQueue`) also doesn't give result to >> me. I will be appreciated if somewone can point where to look, so I can >> debug and check what conditions are preventing documents to be removed. >> >> Thank you! >> >> With respect, >> Artem Abeleshev >> >