Re: Manifoldcf - Job Deletion Process

Karl Wright Tue, 29 Oct 2019 23:55:46 -0700

Ok, so pick ONE of these identifiers.

What I want to see is the entire lifecycle of the ONE identifier.  That
includes what the Web Connection logs as well as what the indexation logs.
Ideally I'd like to see:


- job start and end
- web connection events
- indexing events

I'd like to see these for both the job that indexes the document initially
as well as the job run that deletes the document.

My suspicion is that on the second run the document is simply no longer
reachable from the seeds.  In other words, the seed documents either cannot
be fetched on the second run or they contain different stuff and there's no
longer a chain of links between the seeds and the documents being deleted.

Thanks,
Karl


On Wed, Oct 30, 2019 at 1:50 AM Priya Arora <pr...@smartshore.nl> wrote:

> Indexation screenshot is as below.
>
> [image: image.png]
>
> On Tue, Oct 29, 2019 at 7:57 PM Karl Wright <daddy...@gmail.com> wrote:
>
>> I need both ingestion and deletion.
>> Karl
>>
>>
>> On Tue, Oct 29, 2019 at 8:09 AM Priya Arora <pr...@smartshore.nl> wrote:
>>
>>> History is shown as below as it does not indicates any error.
>>> [image: 12.JPG]
>>>
>>> Thanks
>>> Priya
>>>
>>> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright <daddy...@gmail.com> wrote:
>>>
>>>> What does the history say about these documents?
>>>> Karl
>>>>
>>>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora <pr...@smartshore.nl>
>>>> wrote:
>>>>
>>>>>
>>>>>  it may be that (a) they weren't found, or (b) that the document
>>>>> specification in the job changed and they are no longer included in the 
>>>>> job.
>>>>>
>>>>> URL's that were deleted are valid URL's(as that does not result in
>>>>> 404 or page not found error), and it is not being mentioned in Exclusion
>>>>> tab of job configuration.
>>>>> And the URL's were getting indexed earlier and except for index name
>>>>> in Elasticsearch nothing is changed in Job specification and in other
>>>>> connectors.
>>>>>
>>>>> Thanks
>>>>> Priya
>>>>>
>>>>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright <daddy...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> ManifoldCF is an incremental crawler, which means that on every
>>>>>> (non-continuous) job run it sees which documents it can find and removes
>>>>>> the ones it can't.  The history for the documents being deleted should 
>>>>>> tell
>>>>>> you why they are being deleted -- it may be that (a) they weren't found, 
>>>>>> or
>>>>>> (b) that the document specification in the job changed and they are no
>>>>>> longer included in the job.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora <pr...@smartshore.nl>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I have a query regarding ManifoldCF Job process.I have a job to
>>>>>>> crawl intranet site
>>>>>>> Repository Type:- Web
>>>>>>> Output Connector Type:- Elastic search.
>>>>>>>
>>>>>>> Job have to crawl around4-5 lakhs of total records. I have discarded
>>>>>>> the previous index and created a new index(in Elasticsearch) with proper
>>>>>>> mappings and settings and started the job again after cleaning Database
>>>>>>> even(Database used a PostgreSQL).
>>>>>>> But while the job continues its ingests the records properly but
>>>>>>> just before finishing (some times in between also), it initiates the
>>>>>>> process of Deletions and also it does not index the deleted documents 
>>>>>>> again
>>>>>>> in index.
>>>>>>>
>>>>>>> Can you please something if I am doing anything wrong? or is this a
>>>>>>> process of manifoldcf if yes , why its not getting ingested again.
>>>>>>>
>>>>>>> Thanks and regards
>>>>>>> Priya
>>>>>>>
>>>>>>>

Re: Manifoldcf - Job Deletion Process

Reply via email to