Re: Documents blocked sometimes without errors

Karl Wright Mon, 04 Jun 2018 10:55:03 -0700

I attached a patch to the ticket that is a tentative fix.  Please let me
know if you still see this problem after applying it.  Thanks!


Karl


On Mon, Jun 4, 2018 at 12:56 PM Karl Wright <daddy...@gmail.com> wrote:

> CONNECTORS-1507 created.
> Karl
>
>
> On Mon, Jun 4, 2018 at 12:51 PM Karl Wright <daddy...@gmail.com> wrote:
>
>> I think I found the issue.
>> Basically, when the agents process is restarted, it doesn't reprioritize
>> the documents that were active when the service was brought down, but it
>> should because when the documents became active they lost their document
>> priority.  This should be trivial to fix.
>>
>> Karl
>>
>>
>> On Mon, Jun 4, 2018 at 12:35 PM Karl Wright <daddy...@gmail.com> wrote:
>>
>>> Hi Maxence,
>>>
>>> The docpriority values for these stuck documents show that they are
>>> "null":
>>>
>>>   public static final double noDocPriorityValue = 1e9;
>>>   public static final Double nullDocPriority = new
>>> Double(noDocPriorityValue + 1.0);
>>>
>>> The document status is "G", which is STATUS_PENDINGPURGATORY, so the
>>> documents are awaiting being queued, which they will never be with a
>>> docpriority set to nullDocPriority.
>>>
>>> It isn't supposed to be possible for a document to wind up in this
>>> state.  Documents that are pending are always supposed to set a document
>>> priority.  I will need to review the code to see how this could happen.
>>>
>>> It is also possible that you're seeing a database bug.  I presume that
>>> you are running Postgresql?
>>>
>>> Karl
>>>
>>>
>>> On Mon, Jun 4, 2018 at 8:43 AM msaunier <msaun...@citya.com> wrote:
>>>
>>>> Thanks for your answers.
>>>>
>>>>
>>>>
>>>> So, I join at this email -> interface screen and csv result.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Maxence
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *De :* Karl Wright [mailto:daddy...@gmail.com]
>>>> *Envoyé :* lundi 4 juin 2018 11:36
>>>> *À :* user@manifoldcf.apache.org
>>>> *Objet :* Re: Documents blocked sometimes without errors
>>>>
>>>>
>>>>
>>>> Oh, and it should be unnecessary to pause/resume jobs when you bring
>>>> down ManifoldCF for database maintenance.  Stop the agents service, and
>>>> start it again, and you should pick up exactly where you left off.
>>>>
>>>>
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jun 4, 2018 at 5:33 AM Karl Wright <daddy...@gmail.com> wrote:
>>>>
>>>> Hi Maxence,
>>>>
>>>>
>>>>
>>>> Pausing and restarting a job causes all of its documents to have their
>>>> docpriority field be recalculated.  It should not be necessary to do this
>>>> in order to have job complete, though.
>>>>
>>>>
>>>>
>>>> All documents that are queued have their docpriority set at the time
>>>> they are added to the queue, but the docpriority they are given depends on
>>>> how many documents in the same document bin that have already been given
>>>> docpriority values.  This is done to make sure documents from all bins are
>>>> given an equal chance of being crawled.  But since documents are given a
>>>> docpriority when queued, there may well have been plenty of other documents
>>>> "in front" of them that are already queued and must be processed before
>>>> there's any chance of getting crawled.  So it is possible that documents
>>>> from one job may appear to block documents from another -- but this will
>>>> eventually correct itself and those documents will be crawled.
>>>>
>>>> If you see *no* activity at all, however, then I wonder if somehow
>>>> documents have been queued with a null docpriority.  You can test this by
>>>> looking at the Document Status report and verifying that there is no reason
>>>> the documents should not be crawlable, and then looking in the database to
>>>> see what they have for their docpriority field.  Please let me know what
>>>> you find.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jun 4, 2018 at 4:20 AM msaunier <msaun...@citya.com> wrote:
>>>>
>>>> Hello Karl,
>>>>
>>>>
>>>>
>>>> Sometimes, jobs are blocked by many documents and I don’t know why
>>>> because I don’t have errors. To unblock this, I paused and resume the job
>>>> and it working. This is not always the case and they are never the same
>>>> documents.
>>>>
>>>>
>>>>
>>>> We have a script at 8h55 PM and it’s possibly the reason of this error.
>>>> We have create this script to avoid error, because SCO servers are reboot
>>>> at 9h00 PM and ManifoldCF have an error if they servers are stopped.
>>>>
>>>>
>>>>
>>>> Script explanation:
>>>>
>>>>
>>>>
>>>> 1.       Call PAUSED for the current job at 8h55PM
>>>>
>>>> 2.       Call ManifoldCF stop and wait
>>>>
>>>> 3.       VACUUM FULL Postgres
>>>>
>>>> 4.       REINDEX Postgres
>>>>
>>>> 5.       (Wait 9h05 PM)
>>>>
>>>> 6.       Start ManifoldCF
>>>>
>>>> 7.       Wait ManifoldCF
>>>>
>>>> 8.       Resume job
>>>>
>>>>
>>>>
>>>> Do you have an idea to resolved this problem? It’s the REINDEX or the
>>>> VACUUM FULL the problem?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Maxence
>>>>
>>>>
>>>>
>>>>

Re: Documents blocked sometimes without errors

Reply via email to