Re: Job stucked with cleaning up status

Artem Abeleshev Thu, 02 Feb 2023 21:54:57 -0800

Karl, good day!

Thank you for the hint! It was very useful! Actually, you was right and the
actual problem was about the connection. But I doesn't expect it would be
so dramatic. Here is what I found using some debugging:

First I have found the actual code that was responsible for the deletion of
the documents. It was called by the `DocumentDeleteThread`
(`org.apache.manifoldcf.crawler.system.DocumentDeleteThread`). Then I
checked how many `DocumentDeleteThread` threads supposed to be started. I
haven't override the value and got default 10 threads. Then I grabbed
thread dump and check those threads. I found two strange things:

1. Not all threads were alived. Some of them were terminated.
2. Some live threads have a huge amount of supplementary zk threads like
`Worker thread 'x'-EventThread` and `Worker thread 'n'-EventThread(...)`.
Even the threads that already have been termanted also leave behind theirs
supplementary threads (since they are deamon threads). As a result I have
from 1000 to 2000 threads in total.

I starting to debug the lived threads and come up to the `deletePost`
method of `HttpPoster`
(`org.apache.manifoldcf.agents.output.solr.HttpPoster.deletePost(String,
IOutputRemoveActivity)`). Here I was always getting an exception:

```java
org.apache.solr.client.solrj.SolrServerException: IOException occurred when
talking to server at: http://10.78.11.71:8983/solr/jaweb
org.apache.http.conn.ConnectTimeoutException: Connect to 10.78.11.71:8983 [/
10.78.11.71] failed: connect timed out
```

An exception was due to the Solr was unavailable (i.e. shut down), so here
is no surprise. But the following was a true surpise for me. An exception
I've got is of type `IOException`. Inside the `HttpPoster` that exception
in the end is handled by the method `handleIOException`
(org.apache.manifoldcf.agents.output.solr.HttpPoster.handleIOException(IOException,
String)):

```java
  protected static void handleIOException(IOException e, String context)
    throws ManifoldCFException, ServiceInterruption
  {
    if ((e instanceof InterruptedIOException) && (!(e instanceof
java.net.SocketTimeoutException)))
      throw new ManifoldCFException(e.getMessage(),
ManifoldCFException.INTERRUPTED);
    ...
  }
```

As we can see an exception is wrapped with the `ManifoldCFException`
exception and assigned with the `INTERRUPTED` error code. Then this
exception is bubbling up unitl it ends up in the main loop of the
`DocumentDeleteThread`. Here is the full stack I extract during debug
(unfortunately not a single exception is logged on the way):

```java
org.apache.manifoldcf.agents.output.solr.HttpPoster.handleIOException(HttpPoster.java:514),
org.apache.manifoldcf.agents.output.solr.HttpPoster.handleSolrServerException(HttpPoster.java:427),
org.apache.manifoldcf.agents.output.solr.HttpPoster.deletePost(HttpPoster.java:817),
org.apache.manifoldcf.agents.output.solr.SolrConnector.removeDocument(SolrConnector.java:594),
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.removeDocument(IncrementalIngester.java:2296),
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentDeleteMultiple(IncrementalIngester.java:1037),
org.apache.manifoldcf.crawler.system.DocumentDeleteThread.run(DocumentDeleteThread.java:134)
```

Inside the main loop of the `DocumentDeleteThread` that exception is
handles like this:

```java
public void run()
  {
    try
    {
      ...
      // Loop
      while (true)
      {
        // Do another try/catch around everything in the loop
        try
        {
          ...
        }
        catch (ManifoldCFException e)
        {
          if (e.getErrorCode() == ManifoldCFException.INTERRUPTED)
            break;
            ...
        }
          ...
      }
    }
    catch (Throwable e)
    {
      ...
    }
  }
```

It just breaks the loop making thread terminates normally! In a quite a
short time I always ends up with no `DocumentDeleteThread`s at all and the
framework transit to the incosistent state.

In the end, I made Solr back online and managed to finish deletion
successfully. But I think this case should be handled in some way.

With respect,
Abeleshev Artem

On Sun, Jan 29, 2023 at 10:36 PM Karl Wright <daddy...@gmail.com> wrote:

> Hi,
>
> 2.22 makes no changes to the way document deletions are processed over
> probably 10 previous versions of ManifoldCF.
>
> What likely is the case is that the connection to the output for the job
> you are cleaning up is down.  When that happens, the documents are queued
> but the delete worker threads cannot make any progress.
>
> You can see this maybe by looking at the "Simple Reports" for the job in
> question and see what it is doing and why the deletions are not succeeding.
>
> Karl
>
>
> On Sun, Jan 29, 2023 at 8:18 AM Artem Abeleshev <
> artem.abeles...@rondhuit.com> wrote:
>
>> Hi, everyone!
>>
>> Another problem that I got sometimes. We are using ManifoldCF 2.22.1 with
>> multiple nodes in our production. The creation of the MCF job pipeline is
>> handled via the API calls from our service. We create jobs, repositories
>> and output repositories. The crawler extracts documents and then they are
>> pushed to the Solr. The pipeline works OK.
>>
>> The problem is about deleteing the job. Sometimes the job get stucked
>> with a `Cleaning up` status (in DB it has status `e` that corresponds to
>> status `STATUS_DELETING`). This time I have used MCF Web Admin to delete
>> the job (pressed the delete button on the job list page).
>>
>> I have checked sources and debug it a bit. The method
>> `deleteJobsReadyForDelete()`
>> (`org.apache.manifoldcf.crawler.jobs.JobManager.deleteJobsReadyForDelete()`)
>> is works OK. It is unable to delete the job cause it still found some
>> documents in the document's queue table. The following SQL is executed
>> within this method:
>>
>> ```sql
>> select id from jobqueue where jobid = '1658215015582' and (status = 'E'
>> or status = 'D') limit 1;
>> ```
>>
>> where `E` status stands for `STATUS_ELIGIBLEFORDELETE` and `D` status
>> stands for `STATUS_BEINGDELETED`. If at least one of such a documents is
>> found in the queue it will do nothing. At the moment I had a lot of
>> documents resided within the `jobqueue` having indicated statuses (actually
>> all of them have `D` status).
>>
>> I see that `Documents delete stuffer thread` is running, and it set
>> status `STATUS_BEINGDELETED` to the documents via the
>> `getNextDeletableDocuments()` method
>> (`org.apache.manifoldcf.crawler.jobs.JobManager.getNextDeletableDocuments(String,
>> int, long)`). But I can't find any logic that actually deletes the
>> documents. I've searched throught the sources, but status
>> `STATUS_BEINGDELETED` mentioned mostly in `NOT EXISTS ...` queries.
>> Searching in reverse order from `JobQueue`
>> (`org.apache.manifoldcf.crawler.jobs.JobQueue`) also doesn't give result to
>> me. I will be appreciated if somewone can point where to look, so I can
>> debug and check what conditions are preventing documents to be removed.
>>
>> Thank you!
>>
>> With respect,
>> Artem Abeleshev
>>
>

Re: Job stucked with cleaning up status

Reply via email to