When I created a new job and followed the process of lifecycle/execution of identifier, then it didnt start the Deletion process. There was no any change in job configuration and in database start-up and configurations.
On Sat, Nov 2, 2019 at 12:41 AM Priya Arora <pr...@smartshore.nl> wrote: > No, I am not deleting the job after it run.. its status is getting updated > as ‘Done’ after all process. > Although the process involves indexation of documents and just before the > job ends deletion process executed > Sequence is fetch etc, indexation ,extract other processes , deletion then > job done > > Sent from my iPhone > > On 01-Nov-2019, at 8:42 PM, Karl Wright <daddy...@gmail.com> wrote: > > > So Priya, one thing is not clear to me: are you *deleting* the job after > it runs? > Because if you are, all documents indexed by that job will be deleted as > well. > You need to leave the job around and not delete it unless you want the > documents to go away that the job indexed. > > Karl > > > On Fri, Nov 1, 2019 at 6:51 AM Karl Wright <daddy...@gmail.com> wrote: > >> There is a "Hop filters" tab in the job. This allows you to specify the >> maximum number of hops from the seed documents that are allowed. Or you >> can turn it off entirely, if you do not want this feature. >> >> Bear in mind that documents that are unreachable by *any* means from the >> seed documents will always be deleted at the end of each job run. So if >> you are relying on some special page you generate to point at all the >> documents you want to crawl, make sure it has a complete list. If you try >> to make an incremental list of just the new documents, then all the old >> ones will get removed. >> >> Karl >> >> >> On Fri, Nov 1, 2019 at 6:41 AM Priya Arora <pr...@smartshore.nl> wrote: >> >>> Yes, I have set Authenticity properly, as we have configured this >>> setting by passing this info in header. >>> >>> (1) They are now unreachable, whereas they were reachable before by the >>> specified number of hops from the seed documents; -But If I compared it >>> with the previous index where data is not much old(like a week before), >>> documents(deleted one) were ingested and when i am checking its not >>> resulting in 404. >>> Regarding the specified number of hops from the seed documents;:- Can >>> you please help me with little bit of elaboration >>> >>> Thanks >>> Priya >>> >>> On Fri, Nov 1, 2019 at 3:43 PM Karl Wright <daddy...@gmail.com> wrote: >>> >>>> Hi Priya, >>>> >>>> ManifoldCF doesn't delete documents unless: >>>> (1) They are now unreachable, whereas they were reachable before by the >>>> specified number of hops from the seed documents; >>>> (2) They cannot be fetched due to a 404 error, or something similar >>>> which tells ManifoldCF that they are not available. >>>> >>>> Your site, I notice, has a "sso" page. Are you setting up session >>>> authentication properly? >>>> >>>> Karl >>>> >>>> >>>> On Fri, Nov 1, 2019 at 3:59 AM Priya Arora <pr...@smartshore.nl> wrote: >>>> >>>>> <del2.JPG> >>>>> >>>>> >>>>> Screenshot the Deleted documents other than PDF's >>>>> >>>>> On Fri, Nov 1, 2019 at 1:28 PM Priya Arora <pr...@smartshore.nl> >>>>> wrote: >>>>> >>>>>> The jib was started as per below schedule:- >>>>>> <job.JPG> >>>>>> >>>>>> >>>>>> And just before the completion of the job. It started the Deletion >>>>>> process. Before starting the job a new index in ES was taken and the >>>>>> Database was cleaned up before starting the jib. >>>>>> <deletion.JPG> >>>>>> >>>>>> >>>>>> Records were processed and indexed successfully. When I am checking >>>>>> this URL(those are Deleted) on a browser, it seems to be a valid URl and >>>>>> is >>>>>> accessible. >>>>>> Job is to crawl around 2.25 lakhs of records so the seeded url have >>>>>> many sub-links within. If we think the URL;s were already present in >>>>>> Database that why somehow crawler deletes it, it should not be the case, >>>>>> as >>>>>> the database clean up processed has been done before run. >>>>>> >>>>>> If we think the crawler is deleting only documents related to PDF >>>>>> extension, this is not the case, as other HTML pages are also deleted. >>>>>> >>>>>> Can you please suggest something on this. >>>>>> >>>>>> Thanks >>>>>> Priya >>>>>> >>>>>> On Wed, Oct 30, 2019 at 3:39 PM Karl Wright <daddy...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> So it looks like the URL ending in 117047 was successfully processed >>>>>>> and indexed, and not removed. The URLs ending in 119200 and lang-en >>>>>>> were >>>>>>> both unreachable and were removed. I don't see a job end at all? >>>>>>> There's >>>>>>> a new job start at 12:39 though. >>>>>>> >>>>>>> What I want to see is the lifetime of one of the documents that you >>>>>>> think is getting removed for no reason. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Wed, Oct 30, 2019 at 3:13 AM Priya Arora <pr...@smartshore.nl> >>>>>>> wrote: >>>>>>> >>>>>>>> You want to test the Job whole process start, end and all >>>>>>>> events(from History) by seeding on of these URL's. Below are the >>>>>>>> results:- >>>>>>>> I changed seed URL to the picked one Identifier and then that >>>>>>>> document was fetch and indexed in a new index and the Deletion process >>>>>>>> started. >>>>>>>> <Start.JPG> >>>>>>>> >>>>>>>> <Indexation and Deletion.JPG> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Oct 30, 2019 at 12:25 PM Karl Wright <daddy...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Ok, so pick ONE of these identifiers. >>>>>>>>> >>>>>>>>> What I want to see is the entire lifecycle of the ONE identifier. >>>>>>>>> That includes what the Web Connection logs as well as what the >>>>>>>>> indexation >>>>>>>>> logs. Ideally I'd like to see: >>>>>>>>> >>>>>>>>> - job start and end >>>>>>>>> - web connection events >>>>>>>>> - indexing events >>>>>>>>> >>>>>>>>> I'd like to see these for both the job that indexes the document >>>>>>>>> initially as well as the job run that deletes the document. >>>>>>>>> >>>>>>>>> My suspicion is that on the second run the document is simply no >>>>>>>>> longer reachable from the seeds. In other words, the seed documents >>>>>>>>> either >>>>>>>>> cannot be fetched on the second run or they contain different stuff >>>>>>>>> and >>>>>>>>> there's no longer a chain of links between the seeds and the documents >>>>>>>>> being deleted. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Oct 30, 2019 at 1:50 AM Priya Arora <pr...@smartshore.nl> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Indexation screenshot is as below. >>>>>>>>>> >>>>>>>>>> <image.png> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Oct 29, 2019 at 7:57 PM Karl Wright <daddy...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I need both ingestion and deletion. >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Oct 29, 2019 at 8:09 AM Priya Arora <pr...@smartshore.nl> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> History is shown as below as it does not indicates any error. >>>>>>>>>>>> <12.JPG> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Priya >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright <daddy...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> What does the history say about these documents? >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora < >>>>>>>>>>>>> pr...@smartshore.nl> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> it may be that (a) they weren't found, or (b) that the >>>>>>>>>>>>>> document specification in the job changed and they are no longer >>>>>>>>>>>>>> included >>>>>>>>>>>>>> in the job. >>>>>>>>>>>>>> >>>>>>>>>>>>>> URL's that were deleted are valid URL's(as that does not >>>>>>>>>>>>>> result in 404 or page not found error), and it is not being >>>>>>>>>>>>>> mentioned in >>>>>>>>>>>>>> Exclusion tab of job configuration. >>>>>>>>>>>>>> And the URL's were getting indexed earlier and except for >>>>>>>>>>>>>> index name in Elasticsearch nothing is changed in Job >>>>>>>>>>>>>> specification and in >>>>>>>>>>>>>> other connectors. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Priya >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright < >>>>>>>>>>>>>> daddy...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> ManifoldCF is an incremental crawler, which means that on >>>>>>>>>>>>>>> every (non-continuous) job run it sees which documents it can >>>>>>>>>>>>>>> find and >>>>>>>>>>>>>>> removes the ones it can't. The history for the documents being >>>>>>>>>>>>>>> deleted >>>>>>>>>>>>>>> should tell you why they are being deleted -- it may be that >>>>>>>>>>>>>>> (a) they >>>>>>>>>>>>>>> weren't found, or (b) that the document specification in the >>>>>>>>>>>>>>> job changed >>>>>>>>>>>>>>> and they are no longer included in the job. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora < >>>>>>>>>>>>>>> pr...@smartshore.nl> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi All, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have a query regarding ManifoldCF Job process.I have a >>>>>>>>>>>>>>>> job to crawl intranet site >>>>>>>>>>>>>>>> Repository Type:- Web >>>>>>>>>>>>>>>> Output Connector Type:- Elastic search. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Job have to crawl around4-5 lakhs of total records. I have >>>>>>>>>>>>>>>> discarded the previous index and created a new index(in >>>>>>>>>>>>>>>> Elasticsearch) with >>>>>>>>>>>>>>>> proper mappings and settings and started the job again after >>>>>>>>>>>>>>>> cleaning Database even(Database used a PostgreSQL). >>>>>>>>>>>>>>>> But while the job continues its ingests the records >>>>>>>>>>>>>>>> properly but just before finishing (some times in between >>>>>>>>>>>>>>>> also), it >>>>>>>>>>>>>>>> initiates the process of Deletions and also it does not index >>>>>>>>>>>>>>>> the deleted >>>>>>>>>>>>>>>> documents again in index. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Can you please something if I am doing anything wrong? or >>>>>>>>>>>>>>>> is this a process of manifoldcf if yes , why its not getting >>>>>>>>>>>>>>>> ingested again. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks and regards >>>>>>>>>>>>>>>> Priya >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>