Re: Manifoldcf - Job Deletion Process
Indexation screenshot is as below. [image: image.png] On Tue, Oct 29, 2019 at 7:57 PM Karl Wright wrote: > I need both ingestion and deletion. > Karl > > > On Tue, Oct 29, 2019 at 8:09 AM Priya Arora wrote: > >> History is shown as below as it does not indicates any error. >> [image: 12.JPG] >> >> Thanks >> Priya >> >> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright wrote: >> >>> What does the history say about these documents? >>> Karl >>> >>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora wrote: >>> it may be that (a) they weren't found, or (b) that the document specification in the job changed and they are no longer included in the job. URL's that were deleted are valid URL's(as that does not result in 404 or page not found error), and it is not being mentioned in Exclusion tab of job configuration. And the URL's were getting indexed earlier and except for index name in Elasticsearch nothing is changed in Job specification and in other connectors. Thanks Priya On Tue, Oct 29, 2019 at 3:40 PM Karl Wright wrote: > ManifoldCF is an incremental crawler, which means that on every > (non-continuous) job run it sees which documents it can find and removes > the ones it can't. The history for the documents being deleted should > tell > you why they are being deleted -- it may be that (a) they weren't found, > or > (b) that the document specification in the job changed and they are no > longer included in the job. > > Karl > > > On Tue, Oct 29, 2019 at 5:30 AM Priya Arora > wrote: > >> Hi All, >> >> I have a query regarding ManifoldCF Job process.I have a job to crawl >> intranet site >> Repository Type:- Web >> Output Connector Type:- Elastic search. >> >> Job have to crawl around4-5 lakhs of total records. I have discarded >> the previous index and created a new index(in Elasticsearch) with proper >> mappings and settings and started the job again after cleaning Database >> even(Database used a PostgreSQL). >> But while the job continues its ingests the records properly but just >> before finishing (some times in between also), it initiates the process >> of >> Deletions and also it does not index the deleted documents again in >> index. >> >> Can you please something if I am doing anything wrong? or is this a >> process of manifoldcf if yes , why its not getting ingested again. >> >> Thanks and regards >> Priya >> >>
Re: Manifoldcf - Job Deletion Process
I need both ingestion and deletion. Karl On Tue, Oct 29, 2019 at 8:09 AM Priya Arora wrote: > History is shown as below as it does not indicates any error. > [image: 12.JPG] > > Thanks > Priya > > On Tue, Oct 29, 2019 at 5:02 PM Karl Wright wrote: > >> What does the history say about these documents? >> Karl >> >> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora wrote: >> >>> >>> it may be that (a) they weren't found, or (b) that the document >>> specification in the job changed and they are no longer included in the job. >>> >>> URL's that were deleted are valid URL's(as that does not result in 404 >>> or page not found error), and it is not being mentioned in Exclusion tab of >>> job configuration. >>> And the URL's were getting indexed earlier and except for index name in >>> Elasticsearch nothing is changed in Job specification and in other >>> connectors. >>> >>> Thanks >>> Priya >>> >>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright wrote: >>> ManifoldCF is an incremental crawler, which means that on every (non-continuous) job run it sees which documents it can find and removes the ones it can't. The history for the documents being deleted should tell you why they are being deleted -- it may be that (a) they weren't found, or (b) that the document specification in the job changed and they are no longer included in the job. Karl On Tue, Oct 29, 2019 at 5:30 AM Priya Arora wrote: > Hi All, > > I have a query regarding ManifoldCF Job process.I have a job to crawl > intranet site > Repository Type:- Web > Output Connector Type:- Elastic search. > > Job have to crawl around4-5 lakhs of total records. I have discarded > the previous index and created a new index(in Elasticsearch) with proper > mappings and settings and started the job again after cleaning Database > even(Database used a PostgreSQL). > But while the job continues its ingests the records properly but just > before finishing (some times in between also), it initiates the process of > Deletions and also it does not index the deleted documents again in index. > > Can you please something if I am doing anything wrong? or is this a > process of manifoldcf if yes , why its not getting ingested again. > > Thanks and regards > Priya > >
Re: Manifoldcf - Job Deletion Process
History is shown as below as it does not indicates any error. [image: 12.JPG] Thanks Priya On Tue, Oct 29, 2019 at 5:02 PM Karl Wright wrote: > What does the history say about these documents? > Karl > > On Tue, Oct 29, 2019 at 6:53 AM Priya Arora wrote: > >> >> it may be that (a) they weren't found, or (b) that the document >> specification in the job changed and they are no longer included in the job. >> >> URL's that were deleted are valid URL's(as that does not result in 404 >> or page not found error), and it is not being mentioned in Exclusion tab of >> job configuration. >> And the URL's were getting indexed earlier and except for index name in >> Elasticsearch nothing is changed in Job specification and in other >> connectors. >> >> Thanks >> Priya >> >> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright wrote: >> >>> ManifoldCF is an incremental crawler, which means that on every >>> (non-continuous) job run it sees which documents it can find and removes >>> the ones it can't. The history for the documents being deleted should tell >>> you why they are being deleted -- it may be that (a) they weren't found, or >>> (b) that the document specification in the job changed and they are no >>> longer included in the job. >>> >>> Karl >>> >>> >>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora wrote: >>> Hi All, I have a query regarding ManifoldCF Job process.I have a job to crawl intranet site Repository Type:- Web Output Connector Type:- Elastic search. Job have to crawl around4-5 lakhs of total records. I have discarded the previous index and created a new index(in Elasticsearch) with proper mappings and settings and started the job again after cleaning Database even(Database used a PostgreSQL). But while the job continues its ingests the records properly but just before finishing (some times in between also), it initiates the process of Deletions and also it does not index the deleted documents again in index. Can you please something if I am doing anything wrong? or is this a process of manifoldcf if yes , why its not getting ingested again. Thanks and regards Priya
Re: Manifoldcf - Job Deletion Process
What does the history say about these documents? Karl On Tue, Oct 29, 2019 at 6:53 AM Priya Arora wrote: > > it may be that (a) they weren't found, or (b) that the document > specification in the job changed and they are no longer included in the job. > > URL's that were deleted are valid URL's(as that does not result in 404 or > page not found error), and it is not being mentioned in Exclusion tab of > job configuration. > And the URL's were getting indexed earlier and except for index name in > Elasticsearch nothing is changed in Job specification and in other > connectors. > > Thanks > Priya > > On Tue, Oct 29, 2019 at 3:40 PM Karl Wright wrote: > >> ManifoldCF is an incremental crawler, which means that on every >> (non-continuous) job run it sees which documents it can find and removes >> the ones it can't. The history for the documents being deleted should tell >> you why they are being deleted -- it may be that (a) they weren't found, or >> (b) that the document specification in the job changed and they are no >> longer included in the job. >> >> Karl >> >> >> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora wrote: >> >>> Hi All, >>> >>> I have a query regarding ManifoldCF Job process.I have a job to crawl >>> intranet site >>> Repository Type:- Web >>> Output Connector Type:- Elastic search. >>> >>> Job have to crawl around4-5 lakhs of total records. I have discarded the >>> previous index and created a new index(in Elasticsearch) with proper >>> mappings and settings and started the job again after cleaning Database >>> even(Database used a PostgreSQL). >>> But while the job continues its ingests the records properly but just >>> before finishing (some times in between also), it initiates the process of >>> Deletions and also it does not index the deleted documents again in index. >>> >>> Can you please something if I am doing anything wrong? or is this a >>> process of manifoldcf if yes , why its not getting ingested again. >>> >>> Thanks and regards >>> Priya >>> >>>