I think I found the problem: I also set carry-down data to the parent with the same carry-down key "content", in that case the retrieveParentData for the childIdentifier return both data for itself and the parent... I simply have to change the carry-down identifier of the parent, this is something I have to keep in mind !
Thank for your help Karl -----Message d'origine----- De : julien.massi...@francelabs.com <julien.massi...@francelabs.com> Envoyé : lundi 22 mars 2021 11:29 À : dev@manifoldcf.apache.org Objet : RE: How to override carry down data There is an activities.noDocument called on the parentIdentifier, and an activities.ingestDocumentWithException called on child. They should trigger the method you mention aren't they ? -----Message d'origine----- De : Karl Wright <daddy...@gmail.com> Envoyé : lundi 22 mars 2021 02:20 À : dev <dev@manifoldcf.apache.org> Objet : Re: How to override carry down data It gets called during JobManager.finishDocuments(), here: @Override public DocumentDescription[] finishDocuments(Long jobID, String[] legalLinkTypes, String[] parentIdentifierHashes, int hopcountMethod) throws ManifoldCFException ... // A certain set of carrydown records are going to be deleted by the ensuing restoreRecords command. Calculate that set of records! rval = calculateAffectedRestoreCarrydownChildren(jobID,parentIdentifierHashes); carryDown.restoreRecords(jobID,parentIdentifierHashes); database.performCommit(); ... Is your connector calling the IProcessActivity methods meant to signal that document processing has finished? If not, that is the problem! Karl On Sun, Mar 21, 2021 at 9:14 PM Karl Wright <daddy...@gmail.com> wrote: > Ah, so it appears that the way this works is subtle and clever. > > Values are added or updated in one phase of activity. At this time > the records are flagged with either "new" or "existing". At a later > time, values still in the "base state" are removed, and the "new" and > "existing" > states are mapped back to the base state. > > This is the Carrydown class method that supposedly does the deletion > and rejiggering of the states: > > /** Return all records belonging to the specified parent documents > to the base state, > * and delete the old (eliminated) child records. > */ > public void restoreRecords(Long jobID, String[] parentDocumentIDHashes) > throws ManifoldCFException > > ... and it appears that it does the right thing: > > // Delete > StringBuilder sb = new StringBuilder("WHERE "); > ArrayList newList = new ArrayList(); > > sb.append(buildConjunctionClause(newList,new ClauseDescription[]{ > new UnitaryClause(jobIDField,jobID), > new MultiClause(parentIDHashField,list)})).append(" AND "); > > sb.append(newField).append("=?"); > newList.add(statusToString(ISNEW_BASE)); > performDelete(sb.toString(),newList,null); > > // Restore new values > sb = new StringBuilder("WHERE "); > newList.clear(); > > sb.append(buildConjunctionClause(newList,new ClauseDescription[]{ > new UnitaryClause(jobIDField,jobID), > new MultiClause(parentIDHashField,list)})).append(" AND "); > > sb.append(newField).append(" IN (?,?)"); > newList.add(statusToString(ISNEW_EXISTING)); > newList.add(statusToString(ISNEW_NEW)); > > HashMap map = new HashMap(); > map.put(newField,statusToString(ISNEW_BASE)); > map.put(processIDField,null); > performUpdate(map,sb.toString(),newList,null); > > noteModifications(0,list.size(),0); > > So the question becomes: does it get called appropriately? > > Karl > > > > On Sun, Mar 21, 2021 at 8:45 PM Karl Wright <daddy...@gmail.com> wrote: > >> I've tried to refresh my memory by looking at the carrydown code, >> which is quite old at this point. But one thing is very clear: that >> code never removes carrydown data values unless the child or parent >> document goes away, and wasn't intended to. >> >> It's not at all trivial to do but the code here could be modified to >> set the carrydown values to exactly what is specified in the >> reference for the given parent. However, I worry that changing this >> behavior will break something. Carrydown has a built-in assumption >> that if the reference is added multiple times with different data >> during a crawl, eventually the data will stabilize and no more downstream >> processing will be necessary. >> Carrydown changes that are incautious will result in jobs that never >> complete. >> >> I think it is worth looking at changing the behavior such that no >> accumulation of values takes place, though. It's not an easy change >> I fear. I'll look into how to make it happen. >> >> Karl >> >> >> >> On Sun, Mar 21, 2021 at 1:18 PM <julien.massi...@francelabs.com> wrote: >> >>> ---------------------------- First crawl >>> ----------------------------------------- >>> >>> In the processDocument method the following code is triggered on the >>> parentIdendifier: >>> >>> activities.addDocumentReference(childIdentifier, parentIdentifier, >>> null, new String[] { "content" }, new String[][] { { "someContent" } >>> }); >>> >>> Then the childIdentifier is processed and the following code is >>> triggered in the processDocument method: >>> >>> final String[] contentArray = >>> activities.retrieveParentData(childIdentifier, "content"); >>> >>> At this point, the childIdentifier correctly retrieve a contentArray >>> containing 1 value which is "someContent" >>> >>> ---------------------------- Second crawl >>> ----------------------------------------- >>> >>> In the processDocument method the following code is triggered on the >>> parentIdendifier: >>> >>> activities.addDocumentReference(childIdentifier, parentIdentifier, >>> null, new String[] { "content" }, new String[][] { { "newContent" } >>> }); >>> >>> Then the childIdentifier is processed and the following code is >>> triggered in the processDocument method: >>> >>> final String[] contentArray = >>> activities.retrieveParentData(childIdentifier, "content"); >>> >>> At this point, the childIdentifier retrieves a contentArray >>> containing 2 values, the old one "someContent", and the new one "newContent" >>> >>> I can guarantee that the parentIdentifier between the two crawls is >>> the same and that on the second crawl, only the "newContent" is >>> added, I debugged the code to confirm everything. >>> >>> >>> >>> Julien >>> >>> >>> -----Message d'origine----- >>> De : Karl Wright <daddy...@gmail.com> Envoyé : dimanche 21 mars 2021 >>> 16:05 À : dev <dev@manifoldcf.apache.org> Objet : Re: How to >>> override carry down data >>> >>> Can you give me a code example? >>> The carry-down information is set by the parent, as you say. The >>> specific information is keyed to the parent so when the child is >>> added to the queue, all old carrydown information from the same >>> parent is deleted at that time, and until that happens the carrydown >>> information is preserved for every child. As you say, it can be >>> augmented by other parents that refer to the same child, but it is >>> never *replaced* by carrydown info from a different parent, just augmented. >>> >>> If it didn't work this way, MCF would have horrendous order >>> dependencies in what documents got processed first. As it is, when >>> the carrydown information changes because another parent is >>> discovered, the children are queued for processing to achieve stable >>> results. >>> >>> Karl >>> >>> >>> On Sun, Mar 21, 2021 at 10:45 AM <julien.massi...@francelabs.com> wrote: >>> >>> > Hi Karl, >>> > >>> > >>> > >>> > I am using carry-down data in a repository connector but I have >>> > figured out that I am unable to update/override a value that >>> > already >>> have been set. >>> > Indeed, despite I am using the same key and the same parent >>> > identifier, the values are stacked. So, when I retrieve carry-down >>> > data through the key I get more and more values in the array >>> > instead >>> of only one that is updated. >>> > It seems I misunderstood the documentation, I was believing that >>> > the carry-down data values are stacked only if there are several >>> > parent identifiers for the same key. >>> > What can I do to maintain only one carry-down data value for a >>> > given key and a given parent identifier ? >>> > >>> > >>> > >>> > Regards, >>> > >>> > Julien >>> > >>> > >>> > >>> > >>> >>>