Re: How to override carry down data

Karl Wright Sun, 21 Mar 2021 18:20:25 -0700

It gets called during JobManager.finishDocuments(), here:

  @Override
  public DocumentDescription[] finishDocuments(Long jobID, String[]
legalLinkTypes, String[] parentIdentifierHashes, int hopcountMethod)
    throws ManifoldCFException
...
          // A certain set of carrydown records are going to be deleted by
the ensuing restoreRecords command.  Calculate that set of records!
          rval =
calculateAffectedRestoreCarrydownChildren(jobID,parentIdentifierHashes);
          carryDown.restoreRecords(jobID,parentIdentifierHashes);
          database.performCommit();
...


Is your connector calling the IProcessActivity methods meant to signal that
document processing has finished?  If not, that is the problem!

Karl



On Sun, Mar 21, 2021 at 9:14 PM Karl Wright <daddy...@gmail.com> wrote:

> Ah, so it appears that the way this works is subtle and clever.
>
> Values are added or updated in one phase of activity.  At this time the
> records are flagged with either "new" or "existing".  At a later time,
> values still in the "base state" are removed, and the "new" and "existing"
> states are mapped back to the base state.
>
> This is the Carrydown class method that supposedly does the deletion and
> rejiggering of the states:
>
>   /** Return all records belonging to the specified parent documents to
> the base state,
>   * and delete the old (eliminated) child records.
>   */
>   public void restoreRecords(Long jobID, String[] parentDocumentIDHashes)
>     throws ManifoldCFException
>
> ... and it appears that it does the right thing:
>
>     // Delete
>     StringBuilder sb = new StringBuilder("WHERE ");
>     ArrayList newList = new ArrayList();
>
>     sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
>       new UnitaryClause(jobIDField,jobID),
>       new MultiClause(parentIDHashField,list)})).append(" AND ");
>
>     sb.append(newField).append("=?");
>     newList.add(statusToString(ISNEW_BASE));
>     performDelete(sb.toString(),newList,null);
>
>     // Restore new values
>     sb = new StringBuilder("WHERE ");
>     newList.clear();
>
>     sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
>       new UnitaryClause(jobIDField,jobID),
>       new MultiClause(parentIDHashField,list)})).append(" AND ");
>
>     sb.append(newField).append(" IN (?,?)");
>     newList.add(statusToString(ISNEW_EXISTING));
>     newList.add(statusToString(ISNEW_NEW));
>
>     HashMap map = new HashMap();
>     map.put(newField,statusToString(ISNEW_BASE));
>     map.put(processIDField,null);
>     performUpdate(map,sb.toString(),newList,null);
>
>     noteModifications(0,list.size(),0);
>
> So the question becomes: does it get called appropriately?
>
> Karl
>
>
>
> On Sun, Mar 21, 2021 at 8:45 PM Karl Wright <daddy...@gmail.com> wrote:
>
>> I've tried to refresh my memory by looking at the carrydown code, which
>> is quite old at this point.  But one thing is very clear: that code never
>> removes carrydown data values unless the child or parent document goes
>> away, and wasn't intended to.
>>
>> It's not at all trivial to do but the code here could be modified to set
>> the carrydown values to exactly what is specified in the reference for the
>> given parent.  However, I worry that changing this behavior will break
>> something.  Carrydown has a built-in assumption that if the reference is
>> added multiple times with different data during a crawl, eventually the
>> data will stabilize and no more downstream processing will be necessary.
>> Carrydown changes that are incautious will result in jobs that never
>> complete.
>>
>> I think it is worth looking at changing the behavior such that no
>> accumulation of values takes place, though.  It's not an easy change I
>> fear.  I'll look into how to make it happen.
>>
>> Karl
>>
>>
>>
>> On Sun, Mar 21, 2021 at 1:18 PM <julien.massi...@francelabs.com> wrote:
>>
>>> ---------------------------- First crawl
>>> -----------------------------------------
>>>
>>> In the processDocument method the following code is triggered on the
>>> parentIdendifier:
>>>
>>> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
>>> new String[] { "content" }, new String[][] { { "someContent" } });
>>>
>>> Then the childIdentifier is processed and the following code is
>>> triggered in the processDocument method:
>>>
>>> final String[] contentArray =
>>> activities.retrieveParentData(childIdentifier, "content");
>>>
>>> At this point, the childIdentifier correctly retrieve a contentArray
>>> containing 1 value which is "someContent"
>>>
>>> ---------------------------- Second crawl
>>> -----------------------------------------
>>>
>>> In the processDocument method the following code is triggered on the
>>> parentIdendifier:
>>>
>>> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
>>> new String[] { "content" }, new String[][] { { "newContent" } });
>>>
>>> Then the childIdentifier is processed and the following code is
>>> triggered in the processDocument method:
>>>
>>> final String[] contentArray =
>>> activities.retrieveParentData(childIdentifier, "content");
>>>
>>> At this point, the childIdentifier retrieves a contentArray containing 2
>>> values, the old one "someContent", and the new one "newContent"
>>>
>>> I can guarantee that the parentIdentifier between the two crawls is the
>>> same and that on the second crawl, only the "newContent" is added, I
>>> debugged the code to confirm everything.
>>>
>>>
>>>
>>> Julien
>>>
>>>
>>> -----Message d'origine-----
>>> De : Karl Wright <daddy...@gmail.com>
>>> Envoyé : dimanche 21 mars 2021 16:05
>>> À : dev <dev@manifoldcf.apache.org>
>>> Objet : Re: How to override carry down data
>>>
>>> Can you give me a code example?
>>> The carry-down information is set by the parent, as you say.  The
>>> specific information is keyed to the parent so when the child is added to
>>> the queue, all old carrydown information from the same parent is deleted at
>>> that time, and until that happens the carrydown information is preserved
>>> for every child.  As you say, it can be augmented by other parents that
>>> refer to the same child, but it is never *replaced* by carrydown info from
>>> a different parent, just augmented.
>>>
>>> If it didn't work this way, MCF would have horrendous order dependencies
>>> in what documents got processed first.  As it is, when the carrydown
>>> information changes because another parent is discovered, the children are
>>> queued for processing to achieve stable results.
>>>
>>> Karl
>>>
>>>
>>> On Sun, Mar 21, 2021 at 10:45 AM <julien.massi...@francelabs.com> wrote:
>>>
>>> > Hi Karl,
>>> >
>>> >
>>> >
>>> > I am using carry-down data in a repository connector but I have
>>> > figured out that I am unable to update/override a value that already
>>> have been set.
>>> > Indeed, despite I am using the same key and the same parent
>>> > identifier, the values are stacked. So, when I retrieve carry-down
>>> > data through the key I get more and more values in the array instead
>>> of only one that is updated.
>>> > It seems I misunderstood the documentation, I was believing that the
>>> > carry-down data values are stacked only if there are several parent
>>> > identifiers for the same key.
>>> > What can I do to maintain only one carry-down data value for a given
>>> > key and a given parent identifier ?
>>> >
>>> >
>>> >
>>> > Regards,
>>> >
>>> > Julien
>>> >
>>> >
>>> >
>>> >
>>>
>>>

Re: How to override carry down data

Reply via email to