Re: Manifoldcf - Job Deletion Process

2019-11-05 Thread Priya Arora
When I created a new job and followed the process of lifecycle/execution of
identifier, then it didnt start the Deletion process. There was no any
change in job configuration and in database start-up and configurations.

On Sat, Nov 2, 2019 at 12:41 AM Priya Arora  wrote:

> No, I am not deleting the job after it run.. its status is getting updated
> as ‘Done’ after all process.
> Although the process involves indexation of documents and just before the
> job ends deletion process executed
> Sequence is fetch etc, indexation ,extract other processes , deletion then
> job done
>
> Sent from my iPhone
>
> On 01-Nov-2019, at 8:42 PM, Karl Wright  wrote:
>
> 
> So Priya, one thing is not clear to me: are you *deleting* the job after
> it runs?
> Because if you are, all documents indexed by that job will be deleted as
> well.
> You need to leave the job around and not delete it unless you want the
> documents to go away that the job indexed.
>
> Karl
>
>
> On Fri, Nov 1, 2019 at 6:51 AM Karl Wright  wrote:
>
>> There is a "Hop filters" tab in the job.  This allows you to specify the
>> maximum number of hops from the seed documents that are allowed.  Or you
>> can turn it off entirely, if you do not want this feature.
>>
>> Bear in mind that documents that are unreachable by *any* means from the
>> seed documents will always be deleted at the end of each job run.  So if
>> you are relying on some special page you generate to point at all the
>> documents you want to crawl, make sure it has a complete list.  If you try
>> to make an incremental list of just the new documents, then all the old
>> ones will get removed.
>>
>> Karl
>>
>>
>> On Fri, Nov 1, 2019 at 6:41 AM Priya Arora  wrote:
>>
>>> Yes, I have set Authenticity properly, as we have configured this
>>> setting by passing this info in header.
>>>
>>> (1) They are now unreachable, whereas they were reachable before by the
>>> specified number of hops from the seed documents; -But If I compared it
>>> with the previous index where data is not much old(like a week before),
>>> documents(deleted one) were ingested and when i am checking its not
>>> resulting in 404.
>>> Regarding  the specified number of hops from the seed documents;:- Can
>>> you please help me with little bit of elaboration
>>>
>>> Thanks
>>> Priya
>>>
>>> On Fri, Nov 1, 2019 at 3:43 PM Karl Wright  wrote:
>>>
 Hi Priya,

 ManifoldCF doesn't delete documents unless:
 (1) They are now unreachable, whereas they were reachable before by the
 specified number of hops from the seed documents;
 (2) They cannot be fetched due to a 404 error, or something similar
 which tells ManifoldCF that they are not available.

 Your site, I notice, has a "sso" page.  Are you setting up session
 authentication properly?

 Karl


 On Fri, Nov 1, 2019 at 3:59 AM Priya Arora  wrote:

> 
>
>
> Screenshot the Deleted documents other than PDF's
>
> On Fri, Nov 1, 2019 at 1:28 PM Priya Arora 
> wrote:
>
>> The jib was started as per below schedule:-
>> 
>>
>>
>> And just before the completion of the job. It started the Deletion
>> process. Before starting the job a new index in ES was taken and the
>> Database was cleaned up before starting the jib.
>> 
>>
>>
>> Records were processed and indexed successfully. When I am checking
>> this URL(those are Deleted) on a browser, it seems to be a valid URl and 
>> is
>> accessible.
>> Job is to crawl around 2.25 lakhs of records so the seeded url have
>> many sub-links within. If we think the URL;s were already present in
>> Database that why somehow crawler deletes it, it should not be the case, 
>> as
>> the database clean up processed has been done before run.
>>
>> If we think the crawler is deleting only documents related to PDF
>> extension, this is not the case, as other HTML pages are also deleted.
>>
>> Can you please suggest something on this.
>>
>> Thanks
>> Priya
>>
>> On Wed, Oct 30, 2019 at 3:39 PM Karl Wright 
>> wrote:
>>
>>> So it looks like the URL ending in 117047 was successfully processed
>>> and indexed, and not removed.  The URLs ending in 119200 and lang-en 
>>> were
>>> both unreachable and were removed.  I don't see a job end at all?  
>>> There's
>>> a new job start at 12:39 though.
>>>
>>> What I want to see is the lifetime of one of the documents that you
>>> think is getting removed for no reason.
>>>
>>> Karl
>>>
>>>
>>> On Wed, Oct 30, 2019 at 3:13 AM Priya Arora 
>>> wrote:
>>>
 You want to test the Job whole process start, end and all
 events(from History) by seeding on of these URL's. Below are the 
 results:-
 I changed seed URL to the picked one Identifier and then that
 document was fetch and indexed in a new 

Re: Manifoldcf - Job Deletion Process

2019-11-01 Thread Priya Arora
No, I am not deleting the job after it run.. its status is getting updated as 
‘Done’ after all process.
Although the process involves indexation of documents and just before the job 
ends deletion process executed
Sequence is fetch etc, indexation ,extract other processes , deletion then job 
done

Sent from my iPhone

> On 01-Nov-2019, at 8:42 PM, Karl Wright  wrote:
> 
> 
> So Priya, one thing is not clear to me: are you *deleting* the job after it 
> runs?
> Because if you are, all documents indexed by that job will be deleted as well.
> You need to leave the job around and not delete it unless you want the 
> documents to go away that the job indexed.
> 
> Karl
> 
> 
>> On Fri, Nov 1, 2019 at 6:51 AM Karl Wright  wrote:
>> There is a "Hop filters" tab in the job.  This allows you to specify the 
>> maximum number of hops from the seed documents that are allowed.  Or you can 
>> turn it off entirely, if you do not want this feature.
>> 
>> Bear in mind that documents that are unreachable by *any* means from the 
>> seed documents will always be deleted at the end of each job run.  So if you 
>> are relying on some special page you generate to point at all the documents 
>> you want to crawl, make sure it has a complete list.  If you try to make an 
>> incremental list of just the new documents, then all the old ones will get 
>> removed.
>> 
>> Karl
>> 
>> 
>>> On Fri, Nov 1, 2019 at 6:41 AM Priya Arora  wrote:
>>> Yes, I have set Authenticity properly, as we have configured this setting 
>>> by passing this info in header.
>>> 
>>> (1) They are now unreachable, whereas they were reachable before by the 
>>> specified number of hops from the seed documents; -But If I compared it 
>>> with the previous index where data is not much old(like a week before), 
>>> documents(deleted one) were ingested and when i am checking its not 
>>> resulting in 404.
>>> Regarding  the specified number of hops from the seed documents;:- Can you 
>>> please help me with little bit of elaboration
>>> 
>>> Thanks
>>> Priya
>>> 
 On Fri, Nov 1, 2019 at 3:43 PM Karl Wright  wrote:
 Hi Priya,
 
 ManifoldCF doesn't delete documents unless:
 (1) They are now unreachable, whereas they were reachable before by the 
 specified number of hops from the seed documents;
 (2) They cannot be fetched due to a 404 error, or something similar which 
 tells ManifoldCF that they are not available.
 
 Your site, I notice, has a "sso" page.  Are you setting up session 
 authentication properly?
 
 Karl
 
 
> On Fri, Nov 1, 2019 at 3:59 AM Priya Arora  wrote:
> 
> 
> 
> Screenshot the Deleted documents other than PDF's
> 
>> On Fri, Nov 1, 2019 at 1:28 PM Priya Arora  wrote:
>> The jib was started as per below schedule:-
>> 
>> 
>> 
>> And just before the completion of the job. It started the Deletion 
>> process. Before starting the job a new index in ES was taken and the 
>> Database was cleaned up before starting the jib.
>> 
>> 
>> 
>> Records were processed and indexed successfully. When I am checking this 
>> URL(those are Deleted) on a browser, it seems to be a valid URl and is 
>> accessible.
>> Job is to crawl around 2.25 lakhs of records so the seeded url have many 
>> sub-links within. If we think the URL;s were already present in Database 
>> that why somehow crawler deletes it, it should not be the case, as the 
>> database clean up processed has been done before run.
>> 
>> If we think the crawler is deleting only documents related to PDF 
>> extension, this is not the case, as other HTML pages are also deleted.
>> 
>> Can you please suggest something on this.
>> 
>> Thanks
>> Priya
>> 
>>> On Wed, Oct 30, 2019 at 3:39 PM Karl Wright  wrote:
>>> So it looks like the URL ending in 117047 was successfully processed 
>>> and indexed, and not removed.  The URLs ending in 119200 and lang-en 
>>> were both unreachable and were removed.  I don't see a job end at all?  
>>> There's a new job start at 12:39 though.
>>> 
>>> What I want to see is the lifetime of one of the documents that you 
>>> think is getting removed for no reason.
>>> 
>>> Karl
>>> 
>>> 
 On Wed, Oct 30, 2019 at 3:13 AM Priya Arora  
 wrote:
 You want to test the Job whole process start, end and all events(from 
 History) by seeding on of these URL's. Below are the results:-
 I changed seed URL to the picked one Identifier and then that document 
 was fetch and indexed in a new index and the Deletion process started.
 
 
 
 
 
 
 
> On Wed, Oct 30, 2019 at 12:25 PM Karl Wright  
> wrote:
> Ok, so pick ONE of these identifiers.
> 
> What I want to see is the entire lifecycle of the ONE 

Re: Manifoldcf - Job Deletion Process

2019-10-30 Thread Karl Wright
Ok, so pick ONE of these identifiers.

What I want to see is the entire lifecycle of the ONE identifier.  That
includes what the Web Connection logs as well as what the indexation logs.
Ideally I'd like to see:

- job start and end
- web connection events
- indexing events

I'd like to see these for both the job that indexes the document initially
as well as the job run that deletes the document.

My suspicion is that on the second run the document is simply no longer
reachable from the seeds.  In other words, the seed documents either cannot
be fetched on the second run or they contain different stuff and there's no
longer a chain of links between the seeds and the documents being deleted.

Thanks,
Karl


On Wed, Oct 30, 2019 at 1:50 AM Priya Arora  wrote:

> Indexation screenshot is as below.
>
> [image: image.png]
>
> On Tue, Oct 29, 2019 at 7:57 PM Karl Wright  wrote:
>
>> I need both ingestion and deletion.
>> Karl
>>
>>
>> On Tue, Oct 29, 2019 at 8:09 AM Priya Arora  wrote:
>>
>>> History is shown as below as it does not indicates any error.
>>> [image: 12.JPG]
>>>
>>> Thanks
>>> Priya
>>>
>>> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright  wrote:
>>>
 What does the history say about these documents?
 Karl

 On Tue, Oct 29, 2019 at 6:53 AM Priya Arora 
 wrote:

>
>  it may be that (a) they weren't found, or (b) that the document
> specification in the job changed and they are no longer included in the 
> job.
>
> URL's that were deleted are valid URL's(as that does not result in
> 404 or page not found error), and it is not being mentioned in Exclusion
> tab of job configuration.
> And the URL's were getting indexed earlier and except for index name
> in Elasticsearch nothing is changed in Job specification and in other
> connectors.
>
> Thanks
> Priya
>
> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright 
> wrote:
>
>> ManifoldCF is an incremental crawler, which means that on every
>> (non-continuous) job run it sees which documents it can find and removes
>> the ones it can't.  The history for the documents being deleted should 
>> tell
>> you why they are being deleted -- it may be that (a) they weren't found, 
>> or
>> (b) that the document specification in the job changed and they are no
>> longer included in the job.
>>
>> Karl
>>
>>
>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora 
>> wrote:
>>
>>> Hi All,
>>>
>>> I have a query regarding ManifoldCF Job process.I have a job to
>>> crawl intranet site
>>> Repository Type:- Web
>>> Output Connector Type:- Elastic search.
>>>
>>> Job have to crawl around4-5 lakhs of total records. I have discarded
>>> the previous index and created a new index(in Elasticsearch) with proper
>>> mappings and settings and started the job again after cleaning Database
>>> even(Database used a PostgreSQL).
>>> But while the job continues its ingests the records properly but
>>> just before finishing (some times in between also), it initiates the
>>> process of Deletions and also it does not index the deleted documents 
>>> again
>>> in index.
>>>
>>> Can you please something if I am doing anything wrong? or is this a
>>> process of manifoldcf if yes , why its not getting ingested again.
>>>
>>> Thanks and regards
>>> Priya
>>>
>>>


Re: Manifoldcf - Job Deletion Process

2019-10-29 Thread Priya Arora
Indexation screenshot is as below.

[image: image.png]

On Tue, Oct 29, 2019 at 7:57 PM Karl Wright  wrote:

> I need both ingestion and deletion.
> Karl
>
>
> On Tue, Oct 29, 2019 at 8:09 AM Priya Arora  wrote:
>
>> History is shown as below as it does not indicates any error.
>> [image: 12.JPG]
>>
>> Thanks
>> Priya
>>
>> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright  wrote:
>>
>>> What does the history say about these documents?
>>> Karl
>>>
>>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora  wrote:
>>>

  it may be that (a) they weren't found, or (b) that the document
 specification in the job changed and they are no longer included in the 
 job.

 URL's that were deleted are valid URL's(as that does not result in 404
 or page not found error), and it is not being mentioned in Exclusion tab of
 job configuration.
 And the URL's were getting indexed earlier and except for index name in
 Elasticsearch nothing is changed in Job specification and in other
 connectors.

 Thanks
 Priya

 On Tue, Oct 29, 2019 at 3:40 PM Karl Wright  wrote:

> ManifoldCF is an incremental crawler, which means that on every
> (non-continuous) job run it sees which documents it can find and removes
> the ones it can't.  The history for the documents being deleted should 
> tell
> you why they are being deleted -- it may be that (a) they weren't found, 
> or
> (b) that the document specification in the job changed and they are no
> longer included in the job.
>
> Karl
>
>
> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora 
> wrote:
>
>> Hi All,
>>
>> I have a query regarding ManifoldCF Job process.I have a job to crawl
>> intranet site
>> Repository Type:- Web
>> Output Connector Type:- Elastic search.
>>
>> Job have to crawl around4-5 lakhs of total records. I have discarded
>> the previous index and created a new index(in Elasticsearch) with proper
>> mappings and settings and started the job again after cleaning Database
>> even(Database used a PostgreSQL).
>> But while the job continues its ingests the records properly but just
>> before finishing (some times in between also), it initiates the process 
>> of
>> Deletions and also it does not index the deleted documents again in 
>> index.
>>
>> Can you please something if I am doing anything wrong? or is this a
>> process of manifoldcf if yes , why its not getting ingested again.
>>
>> Thanks and regards
>> Priya
>>
>>


Re: Manifoldcf - Job Deletion Process

2019-10-29 Thread Karl Wright
I need both ingestion and deletion.
Karl


On Tue, Oct 29, 2019 at 8:09 AM Priya Arora  wrote:

> History is shown as below as it does not indicates any error.
> [image: 12.JPG]
>
> Thanks
> Priya
>
> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright  wrote:
>
>> What does the history say about these documents?
>> Karl
>>
>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora  wrote:
>>
>>>
>>>  it may be that (a) they weren't found, or (b) that the document
>>> specification in the job changed and they are no longer included in the job.
>>>
>>> URL's that were deleted are valid URL's(as that does not result in 404
>>> or page not found error), and it is not being mentioned in Exclusion tab of
>>> job configuration.
>>> And the URL's were getting indexed earlier and except for index name in
>>> Elasticsearch nothing is changed in Job specification and in other
>>> connectors.
>>>
>>> Thanks
>>> Priya
>>>
>>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright  wrote:
>>>
 ManifoldCF is an incremental crawler, which means that on every
 (non-continuous) job run it sees which documents it can find and removes
 the ones it can't.  The history for the documents being deleted should tell
 you why they are being deleted -- it may be that (a) they weren't found, or
 (b) that the document specification in the job changed and they are no
 longer included in the job.

 Karl


 On Tue, Oct 29, 2019 at 5:30 AM Priya Arora 
 wrote:

> Hi All,
>
> I have a query regarding ManifoldCF Job process.I have a job to crawl
> intranet site
> Repository Type:- Web
> Output Connector Type:- Elastic search.
>
> Job have to crawl around4-5 lakhs of total records. I have discarded
> the previous index and created a new index(in Elasticsearch) with proper
> mappings and settings and started the job again after cleaning Database
> even(Database used a PostgreSQL).
> But while the job continues its ingests the records properly but just
> before finishing (some times in between also), it initiates the process of
> Deletions and also it does not index the deleted documents again in index.
>
> Can you please something if I am doing anything wrong? or is this a
> process of manifoldcf if yes , why its not getting ingested again.
>
> Thanks and regards
> Priya
>
>


Re: Manifoldcf - Job Deletion Process

2019-10-29 Thread Priya Arora
History is shown as below as it does not indicates any error.
[image: 12.JPG]

Thanks
Priya

On Tue, Oct 29, 2019 at 5:02 PM Karl Wright  wrote:

> What does the history say about these documents?
> Karl
>
> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora  wrote:
>
>>
>>  it may be that (a) they weren't found, or (b) that the document
>> specification in the job changed and they are no longer included in the job.
>>
>> URL's that were deleted are valid URL's(as that does not result in 404
>> or page not found error), and it is not being mentioned in Exclusion tab of
>> job configuration.
>> And the URL's were getting indexed earlier and except for index name in
>> Elasticsearch nothing is changed in Job specification and in other
>> connectors.
>>
>> Thanks
>> Priya
>>
>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright  wrote:
>>
>>> ManifoldCF is an incremental crawler, which means that on every
>>> (non-continuous) job run it sees which documents it can find and removes
>>> the ones it can't.  The history for the documents being deleted should tell
>>> you why they are being deleted -- it may be that (a) they weren't found, or
>>> (b) that the document specification in the job changed and they are no
>>> longer included in the job.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora  wrote:
>>>
 Hi All,

 I have a query regarding ManifoldCF Job process.I have a job to crawl
 intranet site
 Repository Type:- Web
 Output Connector Type:- Elastic search.

 Job have to crawl around4-5 lakhs of total records. I have discarded
 the previous index and created a new index(in Elasticsearch) with proper
 mappings and settings and started the job again after cleaning Database
 even(Database used a PostgreSQL).
 But while the job continues its ingests the records properly but just
 before finishing (some times in between also), it initiates the process of
 Deletions and also it does not index the deleted documents again in index.

 Can you please something if I am doing anything wrong? or is this a
 process of manifoldcf if yes , why its not getting ingested again.

 Thanks and regards
 Priya




Re: Manifoldcf - Job Deletion Process

2019-10-29 Thread Karl Wright
What does the history say about these documents?
Karl

On Tue, Oct 29, 2019 at 6:53 AM Priya Arora  wrote:

>
>  it may be that (a) they weren't found, or (b) that the document
> specification in the job changed and they are no longer included in the job.
>
> URL's that were deleted are valid URL's(as that does not result in 404 or
> page not found error), and it is not being mentioned in Exclusion tab of
> job configuration.
> And the URL's were getting indexed earlier and except for index name in
> Elasticsearch nothing is changed in Job specification and in other
> connectors.
>
> Thanks
> Priya
>
> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright  wrote:
>
>> ManifoldCF is an incremental crawler, which means that on every
>> (non-continuous) job run it sees which documents it can find and removes
>> the ones it can't.  The history for the documents being deleted should tell
>> you why they are being deleted -- it may be that (a) they weren't found, or
>> (b) that the document specification in the job changed and they are no
>> longer included in the job.
>>
>> Karl
>>
>>
>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora  wrote:
>>
>>> Hi All,
>>>
>>> I have a query regarding ManifoldCF Job process.I have a job to crawl
>>> intranet site
>>> Repository Type:- Web
>>> Output Connector Type:- Elastic search.
>>>
>>> Job have to crawl around4-5 lakhs of total records. I have discarded the
>>> previous index and created a new index(in Elasticsearch) with proper
>>> mappings and settings and started the job again after cleaning Database
>>> even(Database used a PostgreSQL).
>>> But while the job continues its ingests the records properly but just
>>> before finishing (some times in between also), it initiates the process of
>>> Deletions and also it does not index the deleted documents again in index.
>>>
>>> Can you please something if I am doing anything wrong? or is this a
>>> process of manifoldcf if yes , why its not getting ingested again.
>>>
>>> Thanks and regards
>>> Priya
>>>
>>>