Ah, so it appears that the way this works is subtle and clever.

Values are added or updated in one phase of activity.  At this time the
records are flagged with either "new" or "existing".  At a later time,
values still in the "base state" are removed, and the "new" and "existing"
states are mapped back to the base state.

This is the Carrydown class method that supposedly does the deletion and
rejiggering of the states:

  /** Return all records belonging to the specified parent documents to the
base state,
  * and delete the old (eliminated) child records.
  */
  public void restoreRecords(Long jobID, String[] parentDocumentIDHashes)
    throws ManifoldCFException

... and it appears that it does the right thing:

    // Delete
    StringBuilder sb = new StringBuilder("WHERE ");
    ArrayList newList = new ArrayList();

    sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
      new UnitaryClause(jobIDField,jobID),
      new MultiClause(parentIDHashField,list)})).append(" AND ");

    sb.append(newField).append("=?");
    newList.add(statusToString(ISNEW_BASE));
    performDelete(sb.toString(),newList,null);

    // Restore new values
    sb = new StringBuilder("WHERE ");
    newList.clear();

    sb.append(buildConjunctionClause(newList,new ClauseDescription[]{
      new UnitaryClause(jobIDField,jobID),
      new MultiClause(parentIDHashField,list)})).append(" AND ");

    sb.append(newField).append(" IN (?,?)");
    newList.add(statusToString(ISNEW_EXISTING));
    newList.add(statusToString(ISNEW_NEW));

    HashMap map = new HashMap();
    map.put(newField,statusToString(ISNEW_BASE));
    map.put(processIDField,null);
    performUpdate(map,sb.toString(),newList,null);

    noteModifications(0,list.size(),0);

So the question becomes: does it get called appropriately?

Karl



On Sun, Mar 21, 2021 at 8:45 PM Karl Wright <daddy...@gmail.com> wrote:

> I've tried to refresh my memory by looking at the carrydown code, which is
> quite old at this point.  But one thing is very clear: that code never
> removes carrydown data values unless the child or parent document goes
> away, and wasn't intended to.
>
> It's not at all trivial to do but the code here could be modified to set
> the carrydown values to exactly what is specified in the reference for the
> given parent.  However, I worry that changing this behavior will break
> something.  Carrydown has a built-in assumption that if the reference is
> added multiple times with different data during a crawl, eventually the
> data will stabilize and no more downstream processing will be necessary.
> Carrydown changes that are incautious will result in jobs that never
> complete.
>
> I think it is worth looking at changing the behavior such that no
> accumulation of values takes place, though.  It's not an easy change I
> fear.  I'll look into how to make it happen.
>
> Karl
>
>
>
> On Sun, Mar 21, 2021 at 1:18 PM <julien.massi...@francelabs.com> wrote:
>
>> ---------------------------- First crawl
>> -----------------------------------------
>>
>> In the processDocument method the following code is triggered on the
>> parentIdendifier:
>>
>> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
>> new String[] { "content" }, new String[][] { { "someContent" } });
>>
>> Then the childIdentifier is processed and the following code is triggered
>> in the processDocument method:
>>
>> final String[] contentArray =
>> activities.retrieveParentData(childIdentifier, "content");
>>
>> At this point, the childIdentifier correctly retrieve a contentArray
>> containing 1 value which is "someContent"
>>
>> ---------------------------- Second crawl
>> -----------------------------------------
>>
>> In the processDocument method the following code is triggered on the
>> parentIdendifier:
>>
>> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
>> new String[] { "content" }, new String[][] { { "newContent" } });
>>
>> Then the childIdentifier is processed and the following code is triggered
>> in the processDocument method:
>>
>> final String[] contentArray =
>> activities.retrieveParentData(childIdentifier, "content");
>>
>> At this point, the childIdentifier retrieves a contentArray containing 2
>> values, the old one "someContent", and the new one "newContent"
>>
>> I can guarantee that the parentIdentifier between the two crawls is the
>> same and that on the second crawl, only the "newContent" is added, I
>> debugged the code to confirm everything.
>>
>>
>>
>> Julien
>>
>>
>> -----Message d'origine-----
>> De : Karl Wright <daddy...@gmail.com>
>> Envoyé : dimanche 21 mars 2021 16:05
>> À : dev <dev@manifoldcf.apache.org>
>> Objet : Re: How to override carry down data
>>
>> Can you give me a code example?
>> The carry-down information is set by the parent, as you say.  The
>> specific information is keyed to the parent so when the child is added to
>> the queue, all old carrydown information from the same parent is deleted at
>> that time, and until that happens the carrydown information is preserved
>> for every child.  As you say, it can be augmented by other parents that
>> refer to the same child, but it is never *replaced* by carrydown info from
>> a different parent, just augmented.
>>
>> If it didn't work this way, MCF would have horrendous order dependencies
>> in what documents got processed first.  As it is, when the carrydown
>> information changes because another parent is discovered, the children are
>> queued for processing to achieve stable results.
>>
>> Karl
>>
>>
>> On Sun, Mar 21, 2021 at 10:45 AM <julien.massi...@francelabs.com> wrote:
>>
>> > Hi Karl,
>> >
>> >
>> >
>> > I am using carry-down data in a repository connector but I have
>> > figured out that I am unable to update/override a value that already
>> have been set.
>> > Indeed, despite I am using the same key and the same parent
>> > identifier, the values are stacked. So, when I retrieve carry-down
>> > data through the key I get more and more values in the array instead of
>> only one that is updated.
>> > It seems I misunderstood the documentation, I was believing that the
>> > carry-down data values are stacked only if there are several parent
>> > identifiers for the same key.
>> > What can I do to maintain only one carry-down data value for a given
>> > key and a given parent identifier ?
>> >
>> >
>> >
>> > Regards,
>> >
>> > Julien
>> >
>> >
>> >
>> >
>>
>>

Reply via email to