Re: Job stucked with cleaning up status

Karl Wright Fri, 03 Feb 2023 03:22:27 -0800

The shutdown procedure for ManifoldCF involves sending interruptions (or
socket interruptions) to all worker threads.  These then put the threads in
the "terminated" state, one by one.  So you should only get this if you
shut down the agents process, or try to.  The handling for this is correct,
although sometimes embedded libraries do not handle thread shutdown
requests properly.


Anyhow, the cause of the problem is actually the fact that the output
connection cannot talk to the service, as stated.

Karl


On Fri, Feb 3, 2023 at 12:54 AM Artem Abeleshev <
artem.abeles...@rondhuit.com> wrote:

> Karl, good day!
>
> Thank you for the hint! It was very useful! Actually, you was right and
> the actual problem was about the connection. But I doesn't expect it would
> be so dramatic. Here is what I found using some debugging:
>
> First I have found the actual code that was responsible for the deletion
> of the documents. It was called by the `DocumentDeleteThread`
> (`org.apache.manifoldcf.crawler.system.DocumentDeleteThread`). Then I
> checked how many `DocumentDeleteThread` threads supposed to be started. I
> haven't override the value and got default 10 threads. Then I grabbed
> thread dump and check those threads. I found two strange things:
>
> 1. Not all threads were alived. Some of them were terminated.
> 2. Some live threads have a huge amount of supplementary zk threads like
> `Worker thread 'x'-EventThread` and `Worker thread 'n'-EventThread(...)`.
> Even the threads that already have been termanted also leave behind theirs
> supplementary threads (since they are deamon threads). As a result I have
> from 1000 to 2000 threads in total.
>
> I starting to debug the lived threads and come up to the `deletePost`
> method of `HttpPoster`
> (`org.apache.manifoldcf.agents.output.solr.HttpPoster.deletePost(String,
> IOutputRemoveActivity)`). Here I was always getting an exception:
>
> ```java
> org.apache.solr.client.solrj.SolrServerException: IOException occurred
> when talking to server at: http://10.78.11.71:8983/solr/jaweb
> org.apache.http.conn.ConnectTimeoutException: Connect to 10.78.11.71:8983
> [/10.78.11.71] failed: connect timed out
> ```
>
> An exception was due to the Solr was unavailable (i.e. shut down), so here
> is no surprise. But the following was a true surpise for me. An exception
> I've got is of type `IOException`. Inside the `HttpPoster` that exception
> in the end is handled by the method `handleIOException`
> (org.apache.manifoldcf.agents.output.solr.HttpPoster.handleIOException(IOException,
> String)):
>
> ```java
>   protected static void handleIOException(IOException e, String context)
>     throws ManifoldCFException, ServiceInterruption
>   {
>     if ((e instanceof InterruptedIOException) && (!(e instanceof
> java.net.SocketTimeoutException)))
>       throw new ManifoldCFException(e.getMessage(),
> ManifoldCFException.INTERRUPTED);
>     ...
>   }
> ```
>
> As we can see an exception is wrapped with the `ManifoldCFException`
> exception and assigned with the `INTERRUPTED` error code. Then this
> exception is bubbling up unitl it ends up in the main loop of the
> `DocumentDeleteThread`. Here is the full stack I extract during debug
> (unfortunately not a single exception is logged on the way):
>
> ```java
>
> org.apache.manifoldcf.agents.output.solr.HttpPoster.handleIOException(HttpPoster.java:514),
>
> org.apache.manifoldcf.agents.output.solr.HttpPoster.handleSolrServerException(HttpPoster.java:427),
>
> org.apache.manifoldcf.agents.output.solr.HttpPoster.deletePost(HttpPoster.java:817),
>
> org.apache.manifoldcf.agents.output.solr.SolrConnector.removeDocument(SolrConnector.java:594),
>
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.removeDocument(IncrementalIngester.java:2296),
>
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentDeleteMultiple(IncrementalIngester.java:1037),
>
> org.apache.manifoldcf.crawler.system.DocumentDeleteThread.run(DocumentDeleteThread.java:134)
> ```
>
> Inside the main loop of the `DocumentDeleteThread` that exception is
> handles like this:
>
> ```java
> public void run()
>   {
>     try
>     {
>       ...
>       // Loop
>       while (true)
>       {
>         // Do another try/catch around everything in the loop
>         try
>         {
>           ...
>         }
>         catch (ManifoldCFException e)
>         {
>           if (e.getErrorCode() == ManifoldCFException.INTERRUPTED)
>             break;
>             ...
>         }
>           ...
>       }
>     }
>     catch (Throwable e)
>     {
>       ...
>     }
>   }
> ```
>
> It just breaks the loop making thread terminates normally! In a quite a
> short time I always ends up with no `DocumentDeleteThread`s at all and the
> framework transit to the incosistent state.
>
> In the end, I made Solr back online and managed to finish deletion
> successfully. But I think this case should be handled in some way.
>
> With respect,
> Abeleshev Artem
>
> On Sun, Jan 29, 2023 at 10:36 PM Karl Wright <daddy...@gmail.com> wrote:
>
>> Hi,
>>
>> 2.22 makes no changes to the way document deletions are processed over
>> probably 10 previous versions of ManifoldCF.
>>
>> What likely is the case is that the connection to the output for the job
>> you are cleaning up is down.  When that happens, the documents are queued
>> but the delete worker threads cannot make any progress.
>>
>> You can see this maybe by looking at the "Simple Reports" for the job in
>> question and see what it is doing and why the deletions are not succeeding.
>>
>> Karl
>>
>>
>> On Sun, Jan 29, 2023 at 8:18 AM Artem Abeleshev <
>> artem.abeles...@rondhuit.com> wrote:
>>
>>> Hi, everyone!
>>>
>>> Another problem that I got sometimes. We are using ManifoldCF 2.22.1
>>> with multiple nodes in our production. The creation of the MCF job pipeline
>>> is handled via the API calls from our service. We create jobs, repositories
>>> and output repositories. The crawler extracts documents and then they are
>>> pushed to the Solr. The pipeline works OK.
>>>
>>> The problem is about deleteing the job. Sometimes the job get stucked
>>> with a `Cleaning up` status (in DB it has status `e` that corresponds to
>>> status `STATUS_DELETING`). This time I have used MCF Web Admin to delete
>>> the job (pressed the delete button on the job list page).
>>>
>>> I have checked sources and debug it a bit. The method
>>> `deleteJobsReadyForDelete()`
>>> (`org.apache.manifoldcf.crawler.jobs.JobManager.deleteJobsReadyForDelete()`)
>>> is works OK. It is unable to delete the job cause it still found some
>>> documents in the document's queue table. The following SQL is executed
>>> within this method:
>>>
>>> ```sql
>>> select id from jobqueue where jobid = '1658215015582' and (status = 'E'
>>> or status = 'D') limit 1;
>>> ```
>>>
>>> where `E` status stands for `STATUS_ELIGIBLEFORDELETE` and `D` status
>>> stands for `STATUS_BEINGDELETED`. If at least one of such a documents is
>>> found in the queue it will do nothing. At the moment I had a lot of
>>> documents resided within the `jobqueue` having indicated statuses (actually
>>> all of them have `D` status).
>>>
>>> I see that `Documents delete stuffer thread` is running, and it set
>>> status `STATUS_BEINGDELETED` to the documents via the
>>> `getNextDeletableDocuments()` method
>>> (`org.apache.manifoldcf.crawler.jobs.JobManager.getNextDeletableDocuments(String,
>>> int, long)`). But I can't find any logic that actually deletes the
>>> documents. I've searched throught the sources, but status
>>> `STATUS_BEINGDELETED` mentioned mostly in `NOT EXISTS ...` queries.
>>> Searching in reverse order from `JobQueue`
>>> (`org.apache.manifoldcf.crawler.jobs.JobQueue`) also doesn't give result to
>>> me. I will be appreciated if somewone can point where to look, so I can
>>> debug and check what conditions are preventing documents to be removed.
>>>
>>> Thank you!
>>>
>>> With respect,
>>> Artem Abeleshev
>>>
>>

Re: Job stucked with cleaning up status

Reply via email to