Re: Manifoldcf - Job Deletion Process

Priya Arora Tue, 05 Nov 2019 21:51:59 -0800

When I created a new job and followed the process of lifecycle/execution of
identifier, then it didnt start the Deletion process. There was no any
change in job configuration and in database start-up and configurations.


On Sat, Nov 2, 2019 at 12:41 AM Priya Arora <pr...@smartshore.nl> wrote:

> No, I am not deleting the job after it run.. its status is getting updated
> as ‘Done’ after all process.
> Although the process involves indexation of documents and just before the
> job ends deletion process executed
> Sequence is fetch etc, indexation ,extract other processes , deletion then
> job done
>
> Sent from my iPhone
>
> On 01-Nov-2019, at 8:42 PM, Karl Wright <daddy...@gmail.com> wrote:
>
> 
> So Priya, one thing is not clear to me: are you *deleting* the job after
> it runs?
> Because if you are, all documents indexed by that job will be deleted as
> well.
> You need to leave the job around and not delete it unless you want the
> documents to go away that the job indexed.
>
> Karl
>
>
> On Fri, Nov 1, 2019 at 6:51 AM Karl Wright <daddy...@gmail.com> wrote:
>
>> There is a "Hop filters" tab in the job.  This allows you to specify the
>> maximum number of hops from the seed documents that are allowed.  Or you
>> can turn it off entirely, if you do not want this feature.
>>
>> Bear in mind that documents that are unreachable by *any* means from the
>> seed documents will always be deleted at the end of each job run.  So if
>> you are relying on some special page you generate to point at all the
>> documents you want to crawl, make sure it has a complete list.  If you try
>> to make an incremental list of just the new documents, then all the old
>> ones will get removed.
>>
>> Karl
>>
>>
>> On Fri, Nov 1, 2019 at 6:41 AM Priya Arora <pr...@smartshore.nl> wrote:
>>
>>> Yes, I have set Authenticity properly, as we have configured this
>>> setting by passing this info in header.
>>>
>>> (1) They are now unreachable, whereas they were reachable before by the
>>> specified number of hops from the seed documents; -But If I compared it
>>> with the previous index where data is not much old(like a week before),
>>> documents(deleted one) were ingested and when i am checking its not
>>> resulting in 404.
>>> Regarding  the specified number of hops from the seed documents;:- Can
>>> you please help me with little bit of elaboration
>>>
>>> Thanks
>>> Priya
>>>
>>> On Fri, Nov 1, 2019 at 3:43 PM Karl Wright <daddy...@gmail.com> wrote:
>>>
>>>> Hi Priya,
>>>>
>>>> ManifoldCF doesn't delete documents unless:
>>>> (1) They are now unreachable, whereas they were reachable before by the
>>>> specified number of hops from the seed documents;
>>>> (2) They cannot be fetched due to a 404 error, or something similar
>>>> which tells ManifoldCF that they are not available.
>>>>
>>>> Your site, I notice, has a "sso" page.  Are you setting up session
>>>> authentication properly?
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Fri, Nov 1, 2019 at 3:59 AM Priya Arora <pr...@smartshore.nl> wrote:
>>>>
>>>>> <del2.JPG>
>>>>>
>>>>>
>>>>> Screenshot the Deleted documents other than PDF's
>>>>>
>>>>> On Fri, Nov 1, 2019 at 1:28 PM Priya Arora <pr...@smartshore.nl>
>>>>> wrote:
>>>>>
>>>>>> The jib was started as per below schedule:-
>>>>>> <job.JPG>
>>>>>>
>>>>>>
>>>>>> And just before the completion of the job. It started the Deletion
>>>>>> process. Before starting the job a new index in ES was taken and the
>>>>>> Database was cleaned up before starting the jib.
>>>>>> <deletion.JPG>
>>>>>>
>>>>>>
>>>>>> Records were processed and indexed successfully. When I am checking
>>>>>> this URL(those are Deleted) on a browser, it seems to be a valid URl and 
>>>>>> is
>>>>>> accessible.
>>>>>> Job is to crawl around 2.25 lakhs of records so the seeded url have
>>>>>> many sub-links within. If we think the URL;s were already present in
>>>>>> Database that why somehow crawler deletes it, it should not be the case, 
>>>>>> as
>>>>>> the database clean up processed has been done before run.
>>>>>>
>>>>>> If we think the crawler is deleting only documents related to PDF
>>>>>> extension, this is not the case, as other HTML pages are also deleted.
>>>>>>
>>>>>> Can you please suggest something on this.
>>>>>>
>>>>>> Thanks
>>>>>> Priya
>>>>>>
>>>>>> On Wed, Oct 30, 2019 at 3:39 PM Karl Wright <daddy...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> So it looks like the URL ending in 117047 was successfully processed
>>>>>>> and indexed, and not removed.  The URLs ending in 119200 and lang-en 
>>>>>>> were
>>>>>>> both unreachable and were removed.  I don't see a job end at all?  
>>>>>>> There's
>>>>>>> a new job start at 12:39 though.
>>>>>>>
>>>>>>> What I want to see is the lifetime of one of the documents that you
>>>>>>> think is getting removed for no reason.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 30, 2019 at 3:13 AM Priya Arora <pr...@smartshore.nl>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> You want to test the Job whole process start, end and all
>>>>>>>> events(from History) by seeding on of these URL's. Below are the 
>>>>>>>> results:-
>>>>>>>> I changed seed URL to the picked one Identifier and then that
>>>>>>>> document was fetch and indexed in a new index and the Deletion process
>>>>>>>> started.
>>>>>>>> <Start.JPG>
>>>>>>>>
>>>>>>>> <Indexation and Deletion.JPG>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Oct 30, 2019 at 12:25 PM Karl Wright <daddy...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Ok, so pick ONE of these identifiers.
>>>>>>>>>
>>>>>>>>> What I want to see is the entire lifecycle of the ONE identifier.
>>>>>>>>> That includes what the Web Connection logs as well as what the 
>>>>>>>>> indexation
>>>>>>>>> logs.  Ideally I'd like to see:
>>>>>>>>>
>>>>>>>>> - job start and end
>>>>>>>>> - web connection events
>>>>>>>>> - indexing events
>>>>>>>>>
>>>>>>>>> I'd like to see these for both the job that indexes the document
>>>>>>>>> initially as well as the job run that deletes the document.
>>>>>>>>>
>>>>>>>>> My suspicion is that on the second run the document is simply no
>>>>>>>>> longer reachable from the seeds.  In other words, the seed documents 
>>>>>>>>> either
>>>>>>>>> cannot be fetched on the second run or they contain different stuff 
>>>>>>>>> and
>>>>>>>>> there's no longer a chain of links between the seeds and the documents
>>>>>>>>> being deleted.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Oct 30, 2019 at 1:50 AM Priya Arora <pr...@smartshore.nl>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Indexation screenshot is as below.
>>>>>>>>>>
>>>>>>>>>> <image.png>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 29, 2019 at 7:57 PM Karl Wright <daddy...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I need both ingestion and deletion.
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 29, 2019 at 8:09 AM Priya Arora <pr...@smartshore.nl>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> History is shown as below as it does not indicates any error.
>>>>>>>>>>>> <12.JPG>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Priya
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright <daddy...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> What does the history say about these documents?
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora <
>>>>>>>>>>>>> pr...@smartshore.nl> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  it may be that (a) they weren't found, or (b) that the
>>>>>>>>>>>>>> document specification in the job changed and they are no longer 
>>>>>>>>>>>>>> included
>>>>>>>>>>>>>> in the job.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> URL's that were deleted are valid URL's(as that does not
>>>>>>>>>>>>>> result in 404 or page not found error), and it is not being 
>>>>>>>>>>>>>> mentioned in
>>>>>>>>>>>>>> Exclusion tab of job configuration.
>>>>>>>>>>>>>> And the URL's were getting indexed earlier and except for
>>>>>>>>>>>>>> index name in Elasticsearch nothing is changed in Job 
>>>>>>>>>>>>>> specification and in
>>>>>>>>>>>>>> other connectors.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Priya
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright <
>>>>>>>>>>>>>> daddy...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ManifoldCF is an incremental crawler, which means that on
>>>>>>>>>>>>>>> every (non-continuous) job run it sees which documents it can 
>>>>>>>>>>>>>>> find and
>>>>>>>>>>>>>>> removes the ones it can't.  The history for the documents being 
>>>>>>>>>>>>>>> deleted
>>>>>>>>>>>>>>> should tell you why they are being deleted -- it may be that 
>>>>>>>>>>>>>>> (a) they
>>>>>>>>>>>>>>> weren't found, or (b) that the document specification in the 
>>>>>>>>>>>>>>> job changed
>>>>>>>>>>>>>>> and they are no longer included in the job.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora <
>>>>>>>>>>>>>>> pr...@smartshore.nl> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have a query regarding ManifoldCF Job process.I have a
>>>>>>>>>>>>>>>> job to crawl intranet site
>>>>>>>>>>>>>>>> Repository Type:- Web
>>>>>>>>>>>>>>>> Output Connector Type:- Elastic search.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Job have to crawl around4-5 lakhs of total records. I have
>>>>>>>>>>>>>>>> discarded the previous index and created a new index(in 
>>>>>>>>>>>>>>>> Elasticsearch) with
>>>>>>>>>>>>>>>> proper mappings and settings and started the job again after
>>>>>>>>>>>>>>>> cleaning Database even(Database used a PostgreSQL).
>>>>>>>>>>>>>>>> But while the job continues its ingests the records
>>>>>>>>>>>>>>>> properly but just before finishing (some times in between 
>>>>>>>>>>>>>>>> also), it
>>>>>>>>>>>>>>>> initiates the process of Deletions and also it does not index 
>>>>>>>>>>>>>>>> the deleted
>>>>>>>>>>>>>>>> documents again in index.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can you please something if I am doing anything wrong? or
>>>>>>>>>>>>>>>> is this a process of manifoldcf if yes , why its not getting 
>>>>>>>>>>>>>>>> ingested again.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks and regards
>>>>>>>>>>>>>>>> Priya
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: Manifoldcf - Job Deletion Process

Reply via email to