Re: Manifoldcf - Job Deletion Process

2019-10-29 Thread Priya Arora
Indexation screenshot is as below.

[image: image.png]

On Tue, Oct 29, 2019 at 7:57 PM Karl Wright  wrote:

> I need both ingestion and deletion.
> Karl
>
>
> On Tue, Oct 29, 2019 at 8:09 AM Priya Arora  wrote:
>
>> History is shown as below as it does not indicates any error.
>> [image: 12.JPG]
>>
>> Thanks
>> Priya
>>
>> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright  wrote:
>>
>>> What does the history say about these documents?
>>> Karl
>>>
>>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora  wrote:
>>>

  it may be that (a) they weren't found, or (b) that the document
 specification in the job changed and they are no longer included in the 
 job.

 URL's that were deleted are valid URL's(as that does not result in 404
 or page not found error), and it is not being mentioned in Exclusion tab of
 job configuration.
 And the URL's were getting indexed earlier and except for index name in
 Elasticsearch nothing is changed in Job specification and in other
 connectors.

 Thanks
 Priya

 On Tue, Oct 29, 2019 at 3:40 PM Karl Wright  wrote:

> ManifoldCF is an incremental crawler, which means that on every
> (non-continuous) job run it sees which documents it can find and removes
> the ones it can't.  The history for the documents being deleted should 
> tell
> you why they are being deleted -- it may be that (a) they weren't found, 
> or
> (b) that the document specification in the job changed and they are no
> longer included in the job.
>
> Karl
>
>
> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora 
> wrote:
>
>> Hi All,
>>
>> I have a query regarding ManifoldCF Job process.I have a job to crawl
>> intranet site
>> Repository Type:- Web
>> Output Connector Type:- Elastic search.
>>
>> Job have to crawl around4-5 lakhs of total records. I have discarded
>> the previous index and created a new index(in Elasticsearch) with proper
>> mappings and settings and started the job again after cleaning Database
>> even(Database used a PostgreSQL).
>> But while the job continues its ingests the records properly but just
>> before finishing (some times in between also), it initiates the process 
>> of
>> Deletions and also it does not index the deleted documents again in 
>> index.
>>
>> Can you please something if I am doing anything wrong? or is this a
>> process of manifoldcf if yes , why its not getting ingested again.
>>
>> Thanks and regards
>> Priya
>>
>>


Re: Manifoldcf - Job Deletion Process

2019-10-29 Thread Karl Wright
I need both ingestion and deletion.
Karl


On Tue, Oct 29, 2019 at 8:09 AM Priya Arora  wrote:

> History is shown as below as it does not indicates any error.
> [image: 12.JPG]
>
> Thanks
> Priya
>
> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright  wrote:
>
>> What does the history say about these documents?
>> Karl
>>
>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora  wrote:
>>
>>>
>>>  it may be that (a) they weren't found, or (b) that the document
>>> specification in the job changed and they are no longer included in the job.
>>>
>>> URL's that were deleted are valid URL's(as that does not result in 404
>>> or page not found error), and it is not being mentioned in Exclusion tab of
>>> job configuration.
>>> And the URL's were getting indexed earlier and except for index name in
>>> Elasticsearch nothing is changed in Job specification and in other
>>> connectors.
>>>
>>> Thanks
>>> Priya
>>>
>>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright  wrote:
>>>
 ManifoldCF is an incremental crawler, which means that on every
 (non-continuous) job run it sees which documents it can find and removes
 the ones it can't.  The history for the documents being deleted should tell
 you why they are being deleted -- it may be that (a) they weren't found, or
 (b) that the document specification in the job changed and they are no
 longer included in the job.

 Karl


 On Tue, Oct 29, 2019 at 5:30 AM Priya Arora 
 wrote:

> Hi All,
>
> I have a query regarding ManifoldCF Job process.I have a job to crawl
> intranet site
> Repository Type:- Web
> Output Connector Type:- Elastic search.
>
> Job have to crawl around4-5 lakhs of total records. I have discarded
> the previous index and created a new index(in Elasticsearch) with proper
> mappings and settings and started the job again after cleaning Database
> even(Database used a PostgreSQL).
> But while the job continues its ingests the records properly but just
> before finishing (some times in between also), it initiates the process of
> Deletions and also it does not index the deleted documents again in index.
>
> Can you please something if I am doing anything wrong? or is this a
> process of manifoldcf if yes , why its not getting ingested again.
>
> Thanks and regards
> Priya
>
>


Re: Manifoldcf - Job Deletion Process

2019-10-29 Thread Priya Arora
History is shown as below as it does not indicates any error.
[image: 12.JPG]

Thanks
Priya

On Tue, Oct 29, 2019 at 5:02 PM Karl Wright  wrote:

> What does the history say about these documents?
> Karl
>
> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora  wrote:
>
>>
>>  it may be that (a) they weren't found, or (b) that the document
>> specification in the job changed and they are no longer included in the job.
>>
>> URL's that were deleted are valid URL's(as that does not result in 404
>> or page not found error), and it is not being mentioned in Exclusion tab of
>> job configuration.
>> And the URL's were getting indexed earlier and except for index name in
>> Elasticsearch nothing is changed in Job specification and in other
>> connectors.
>>
>> Thanks
>> Priya
>>
>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright  wrote:
>>
>>> ManifoldCF is an incremental crawler, which means that on every
>>> (non-continuous) job run it sees which documents it can find and removes
>>> the ones it can't.  The history for the documents being deleted should tell
>>> you why they are being deleted -- it may be that (a) they weren't found, or
>>> (b) that the document specification in the job changed and they are no
>>> longer included in the job.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora  wrote:
>>>
 Hi All,

 I have a query regarding ManifoldCF Job process.I have a job to crawl
 intranet site
 Repository Type:- Web
 Output Connector Type:- Elastic search.

 Job have to crawl around4-5 lakhs of total records. I have discarded
 the previous index and created a new index(in Elasticsearch) with proper
 mappings and settings and started the job again after cleaning Database
 even(Database used a PostgreSQL).
 But while the job continues its ingests the records properly but just
 before finishing (some times in between also), it initiates the process of
 Deletions and also it does not index the deleted documents again in index.

 Can you please something if I am doing anything wrong? or is this a
 process of manifoldcf if yes , why its not getting ingested again.

 Thanks and regards
 Priya




Re: Manifoldcf - Job Deletion Process

2019-10-29 Thread Karl Wright
What does the history say about these documents?
Karl

On Tue, Oct 29, 2019 at 6:53 AM Priya Arora  wrote:

>
>  it may be that (a) they weren't found, or (b) that the document
> specification in the job changed and they are no longer included in the job.
>
> URL's that were deleted are valid URL's(as that does not result in 404 or
> page not found error), and it is not being mentioned in Exclusion tab of
> job configuration.
> And the URL's were getting indexed earlier and except for index name in
> Elasticsearch nothing is changed in Job specification and in other
> connectors.
>
> Thanks
> Priya
>
> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright  wrote:
>
>> ManifoldCF is an incremental crawler, which means that on every
>> (non-continuous) job run it sees which documents it can find and removes
>> the ones it can't.  The history for the documents being deleted should tell
>> you why they are being deleted -- it may be that (a) they weren't found, or
>> (b) that the document specification in the job changed and they are no
>> longer included in the job.
>>
>> Karl
>>
>>
>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora  wrote:
>>
>>> Hi All,
>>>
>>> I have a query regarding ManifoldCF Job process.I have a job to crawl
>>> intranet site
>>> Repository Type:- Web
>>> Output Connector Type:- Elastic search.
>>>
>>> Job have to crawl around4-5 lakhs of total records. I have discarded the
>>> previous index and created a new index(in Elasticsearch) with proper
>>> mappings and settings and started the job again after cleaning Database
>>> even(Database used a PostgreSQL).
>>> But while the job continues its ingests the records properly but just
>>> before finishing (some times in between also), it initiates the process of
>>> Deletions and also it does not index the deleted documents again in index.
>>>
>>> Can you please something if I am doing anything wrong? or is this a
>>> process of manifoldcf if yes , why its not getting ingested again.
>>>
>>> Thanks and regards
>>> Priya
>>>
>>>