It gets called during JobManager.finishDocuments(), here: @Override public DocumentDescription[] finishDocuments(Long jobID, String[] legalLinkTypes, String[] parentIdentifierHashes, int hopcountMethod) throws ManifoldCFException ... // A certain set of carrydown records are going to be deleted by the ensuing restoreRecords command. Calculate that set of records! rval = calculateAffectedRestoreCarrydownChildren(jobID,parentIdentifierHashes); carryDown.restoreRecords(jobID,parentIdentifierHashes); database.performCommit(); ...
Is your connector calling the IProcessActivity methods meant to signal that document processing has finished? If not, that is the problem! Karl On Sun, Mar 21, 2021 at 9:14 PM Karl Wright <daddy...@gmail.com> wrote: > Ah, so it appears that the way this works is subtle and clever. > > Values are added or updated in one phase of activity. At this time the > records are flagged with either "new" or "existing". At a later time, > values still in the "base state" are removed, and the "new" and "existing" > states are mapped back to the base state. > > This is the Carrydown class method that supposedly does the deletion and > rejiggering of the states: > > /** Return all records belonging to the specified parent documents to > the base state, > * and delete the old (eliminated) child records. > */ > public void restoreRecords(Long jobID, String[] parentDocumentIDHashes) > throws ManifoldCFException > > ... and it appears that it does the right thing: > > // Delete > StringBuilder sb = new StringBuilder("WHERE "); > ArrayList newList = new ArrayList(); > > sb.append(buildConjunctionClause(newList,new ClauseDescription[]{ > new UnitaryClause(jobIDField,jobID), > new MultiClause(parentIDHashField,list)})).append(" AND "); > > sb.append(newField).append("=?"); > newList.add(statusToString(ISNEW_BASE)); > performDelete(sb.toString(),newList,null); > > // Restore new values > sb = new StringBuilder("WHERE "); > newList.clear(); > > sb.append(buildConjunctionClause(newList,new ClauseDescription[]{ > new UnitaryClause(jobIDField,jobID), > new MultiClause(parentIDHashField,list)})).append(" AND "); > > sb.append(newField).append(" IN (?,?)"); > newList.add(statusToString(ISNEW_EXISTING)); > newList.add(statusToString(ISNEW_NEW)); > > HashMap map = new HashMap(); > map.put(newField,statusToString(ISNEW_BASE)); > map.put(processIDField,null); > performUpdate(map,sb.toString(),newList,null); > > noteModifications(0,list.size(),0); > > So the question becomes: does it get called appropriately? > > Karl > > > > On Sun, Mar 21, 2021 at 8:45 PM Karl Wright <daddy...@gmail.com> wrote: > >> I've tried to refresh my memory by looking at the carrydown code, which >> is quite old at this point. But one thing is very clear: that code never >> removes carrydown data values unless the child or parent document goes >> away, and wasn't intended to. >> >> It's not at all trivial to do but the code here could be modified to set >> the carrydown values to exactly what is specified in the reference for the >> given parent. However, I worry that changing this behavior will break >> something. Carrydown has a built-in assumption that if the reference is >> added multiple times with different data during a crawl, eventually the >> data will stabilize and no more downstream processing will be necessary. >> Carrydown changes that are incautious will result in jobs that never >> complete. >> >> I think it is worth looking at changing the behavior such that no >> accumulation of values takes place, though. It's not an easy change I >> fear. I'll look into how to make it happen. >> >> Karl >> >> >> >> On Sun, Mar 21, 2021 at 1:18 PM <julien.massi...@francelabs.com> wrote: >> >>> ---------------------------- First crawl >>> ----------------------------------------- >>> >>> In the processDocument method the following code is triggered on the >>> parentIdendifier: >>> >>> activities.addDocumentReference(childIdentifier, parentIdentifier, null, >>> new String[] { "content" }, new String[][] { { "someContent" } }); >>> >>> Then the childIdentifier is processed and the following code is >>> triggered in the processDocument method: >>> >>> final String[] contentArray = >>> activities.retrieveParentData(childIdentifier, "content"); >>> >>> At this point, the childIdentifier correctly retrieve a contentArray >>> containing 1 value which is "someContent" >>> >>> ---------------------------- Second crawl >>> ----------------------------------------- >>> >>> In the processDocument method the following code is triggered on the >>> parentIdendifier: >>> >>> activities.addDocumentReference(childIdentifier, parentIdentifier, null, >>> new String[] { "content" }, new String[][] { { "newContent" } }); >>> >>> Then the childIdentifier is processed and the following code is >>> triggered in the processDocument method: >>> >>> final String[] contentArray = >>> activities.retrieveParentData(childIdentifier, "content"); >>> >>> At this point, the childIdentifier retrieves a contentArray containing 2 >>> values, the old one "someContent", and the new one "newContent" >>> >>> I can guarantee that the parentIdentifier between the two crawls is the >>> same and that on the second crawl, only the "newContent" is added, I >>> debugged the code to confirm everything. >>> >>> >>> >>> Julien >>> >>> >>> -----Message d'origine----- >>> De : Karl Wright <daddy...@gmail.com> >>> Envoyé : dimanche 21 mars 2021 16:05 >>> À : dev <dev@manifoldcf.apache.org> >>> Objet : Re: How to override carry down data >>> >>> Can you give me a code example? >>> The carry-down information is set by the parent, as you say. The >>> specific information is keyed to the parent so when the child is added to >>> the queue, all old carrydown information from the same parent is deleted at >>> that time, and until that happens the carrydown information is preserved >>> for every child. As you say, it can be augmented by other parents that >>> refer to the same child, but it is never *replaced* by carrydown info from >>> a different parent, just augmented. >>> >>> If it didn't work this way, MCF would have horrendous order dependencies >>> in what documents got processed first. As it is, when the carrydown >>> information changes because another parent is discovered, the children are >>> queued for processing to achieve stable results. >>> >>> Karl >>> >>> >>> On Sun, Mar 21, 2021 at 10:45 AM <julien.massi...@francelabs.com> wrote: >>> >>> > Hi Karl, >>> > >>> > >>> > >>> > I am using carry-down data in a repository connector but I have >>> > figured out that I am unable to update/override a value that already >>> have been set. >>> > Indeed, despite I am using the same key and the same parent >>> > identifier, the values are stacked. So, when I retrieve carry-down >>> > data through the key I get more and more values in the array instead >>> of only one that is updated. >>> > It seems I misunderstood the documentation, I was believing that the >>> > carry-down data values are stacked only if there are several parent >>> > identifiers for the same key. >>> > What can I do to maintain only one carry-down data value for a given >>> > key and a given parent identifier ? >>> > >>> > >>> > >>> > Regards, >>> > >>> > Julien >>> > >>> > >>> > >>> > >>> >>>