Thanks Karl, As compared to all three methods suggested by you, i believe writing to file would be easier, correct me if i am wrong.
What i initially thought that while job is running, i need to write counter values for each document seeded and processed as we are calling addSeedDocument() & processDocument() methods for each document. In this case, it would not be easy to reconcile after job is complete as i do have loads of data once job finishes and mapping them would be tough. This is why i am trying to avoid file based mechanism. Also i would hit the tracking issue as we are calling connector object multiple times and having multiple agents running parallely. Please suggest. Regards. On Tue, Sep 16, 2014 at 11:59 AM, Karl Wright <daddy...@gmail.com> wrote: > Hi Lalit, > > So, let me clarify: you want some independent measure as to whether every > document seeded, per job, has been in fact processed? > > If that is a correct statement, there is by definition no "in code" way to > do it, since there are multiple agents running in your setup. Each agent > may process some of the documents, and certainly no agent will process all > of them. Also, restarting any agents process will lose the information you > are attempting to record. > > So you are stuck with three possibilities: > > The first possibility is to use [INFO] statements written to the log. > This would work, but you don't have the information you need in your > connector (specifically the job ID), so you would have to add these logging > statements to various places in the ManifoldCF framework. > > The second possibility is to make use of the history database table, where > events are recorded. You could create two new activity types, also written > within the framework, for tracking seeding of records and for tracking > processing of records. There are already activity types for job start and > end. > > Finally, the third possibility: If you must absolutely avoid the file > system, you would have to write a tracking process which allowed ManifoldCF > threads to connect via sockets and communicate document seeding and > processing events. Once again, within the framework, you would transmit > events to the recording process. This system would be at risk of losing > tracking data when your tracking process needed to be restarted, however. > > None of these are trivial to implement. Essentially, keeping track of > documents is what MCF uses the database for in the first place, so this > requirement is like insisting that there be a second ManifoldCF there to be > sure that the first one did the right thing. It's an incredible waste of > resources, frankly. Using the log is perhaps the simplest to implement and > most consistent with what clients might be expecting, but it has very > significant I/O costs. Using the history table has a similar problem, > while also putting your database under load. The last solution requires a > lot of well-constructed code and remains vulnerable to system instability. > Take your pick. > > Karl > > > Thanks, > Karl > > > On Tue, Sep 16, 2014 at 12:54 AM, lalit jangra <lalit.j.jan...@gmail.com> > wrote: > >> Greetings , >> >> As part of implementation, i need to put a reconciliation mechanism in >> place where it can be verified how many documents have been crawled for a >> job and same can be displayed in logs. >> >> First thing came into my mind is to put counters in e.g. CMIS connector >> code in addSeed() and proecessDocuments() methods and increase it as we >> progress but as i could see for CMIS that CmisRepositoryConnector.java is >> getting called for each seeded document to be ingested, these counters are >> not accurate. Is there any method where i can persist these counters within >> code itself as i do not want to persist them in file system. >> >> Please suggest. >> -- >> Regards, >> Lalit. >> > > -- Regards, Lalit.