Re: Documents blocked sometimes without errors

Karl Wright Mon, 18 Jun 2018 01:02:37 -0700

The only way to know if these are truly blocked is to find the document
records in the database and include them here.


Thanks,
Karl


On Mon, Jun 18, 2018 at 3:55 AM msaunier <msaun...@citya.com> wrote:

> Hello Karl,
>
>
>
> Today, I have 2 documents blocked on the new trunk version (I think). Can
> I verify my trunk vertion after the build?
>
>
>
> Thanks,
>
> Maxence ,
>
>
>
>
>
> *De :* msaunier [mailto:msaun...@citya.com]
> *Envoyé :* mardi 5 juin 2018 14:54
> *À :* 'user@manifoldcf.apache.org' <user@manifoldcf.apache.org>
> *Objet :* RE: Documents blocked sometimes without errors
>
>
>
> Ok. I have build and deploy.
>
>
>
> The tests are in progress.
>
>
>
> Thanks,
>
> Maxence
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com <daddy...@gmail.com>]
> *Envoyé :* lundi 4 juin 2018 19:55
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Documents blocked sometimes without errors
>
>
>
> I attached a patch to the ticket that is a tentative fix.  Please let me
> know if you still see this problem after applying it.  Thanks!
>
>
>
> Karl
>
>
>
>
>
> On Mon, Jun 4, 2018 at 12:56 PM Karl Wright <daddy...@gmail.com> wrote:
>
> CONNECTORS-1507 created.
>
> Karl
>
>
>
>
>
> On Mon, Jun 4, 2018 at 12:51 PM Karl Wright <daddy...@gmail.com> wrote:
>
> I think I found the issue.
>
> Basically, when the agents process is restarted, it doesn't reprioritize
> the documents that were active when the service was brought down, but it
> should because when the documents became active they lost their document
> priority.  This should be trivial to fix.
>
>
>
> Karl
>
>
>
>
>
> On Mon, Jun 4, 2018 at 12:35 PM Karl Wright <daddy...@gmail.com> wrote:
>
> Hi Maxence,
>
>
>
> The docpriority values for these stuck documents show that they are "null":
>
>
>
>   public static final double noDocPriorityValue = 1e9;
>
>   public static final Double nullDocPriority = new
> Double(noDocPriorityValue + 1.0);
>
>
>
> The document status is "G", which is STATUS_PENDINGPURGATORY, so the
> documents are awaiting being queued, which they will never be with a
> docpriority set to nullDocPriority.
>
>
>
> It isn't supposed to be possible for a document to wind up in this state.
> Documents that are pending are always supposed to set a document priority.
> I will need to review the code to see how this could happen.
>
> It is also possible that you're seeing a database bug.  I presume that you
> are running Postgresql?
>
>
>
> Karl
>
>
>
>
>
> On Mon, Jun 4, 2018 at 8:43 AM msaunier <msaun...@citya.com> wrote:
>
> Thanks for your answers.
>
>
>
> So, I join at this email -> interface screen and csv result.
>
>
>
> Thanks,
>
> Maxence
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* lundi 4 juin 2018 11:36
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Documents blocked sometimes without errors
>
>
>
> Oh, and it should be unnecessary to pause/resume jobs when you bring down
> ManifoldCF for database maintenance.  Stop the agents service, and start it
> again, and you should pick up exactly where you left off.
>
>
>
> Karl
>
>
>
>
>
> On Mon, Jun 4, 2018 at 5:33 AM Karl Wright <daddy...@gmail.com> wrote:
>
> Hi Maxence,
>
>
>
> Pausing and restarting a job causes all of its documents to have their
> docpriority field be recalculated.  It should not be necessary to do this
> in order to have job complete, though.
>
>
>
> All documents that are queued have their docpriority set at the time they
> are added to the queue, but the docpriority they are given depends on how
> many documents in the same document bin that have already been given
> docpriority values.  This is done to make sure documents from all bins are
> given an equal chance of being crawled.  But since documents are given a
> docpriority when queued, there may well have been plenty of other documents
> "in front" of them that are already queued and must be processed before
> there's any chance of getting crawled.  So it is possible that documents
> from one job may appear to block documents from another -- but this will
> eventually correct itself and those documents will be crawled.
>
> If you see *no* activity at all, however, then I wonder if somehow
> documents have been queued with a null docpriority.  You can test this by
> looking at the Document Status report and verifying that there is no reason
> the documents should not be crawlable, and then looking in the database to
> see what they have for their docpriority field.  Please let me know what
> you find.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Mon, Jun 4, 2018 at 4:20 AM msaunier <msaun...@citya.com> wrote:
>
> Hello Karl,
>
>
>
> Sometimes, jobs are blocked by many documents and I don’t know why because
> I don’t have errors. To unblock this, I paused and resume the job and it
> working. This is not always the case and they are never the same documents.
>
>
>
> We have a script at 8h55 PM and it’s possibly the reason of this error. We
> have create this script to avoid error, because SCO servers are reboot at
> 9h00 PM and ManifoldCF have an error if they servers are stopped.
>
>
>
> Script explanation:
>
>
>
> 1.       Call PAUSED for the current job at 8h55PM
>
> 2.       Call ManifoldCF stop and wait
>
> 3.       VACUUM FULL Postgres
>
> 4.       REINDEX Postgres
>
> 5.       (Wait 9h05 PM)
>
> 6.       Start ManifoldCF
>
> 7.       Wait ManifoldCF
>
> 8.       Resume job
>
>
>
> Do you have an idea to resolved this problem? It’s the REINDEX or the
> VACUUM FULL the problem?
>
>
>
> Thanks,
>
> Maxence
>
>
>
>

Re: Documents blocked sometimes without errors

Reply via email to