I've tried to refresh my memory by looking at the carrydown code, which is
quite old at this point.  But one thing is very clear: that code never
removes carrydown data values unless the child or parent document goes
away, and wasn't intended to.

It's not at all trivial to do but the code here could be modified to set
the carrydown values to exactly what is specified in the reference for the
given parent.  However, I worry that changing this behavior will break
something.  Carrydown has a built-in assumption that if the reference is
added multiple times with different data during a crawl, eventually the
data will stabilize and no more downstream processing will be necessary.
Carrydown changes that are incautious will result in jobs that never
complete.

I think it is worth looking at changing the behavior such that no
accumulation of values takes place, though.  It's not an easy change I
fear.  I'll look into how to make it happen.

Karl



On Sun, Mar 21, 2021 at 1:18 PM <julien.massi...@francelabs.com> wrote:

> ---------------------------- First crawl
> -----------------------------------------
>
> In the processDocument method the following code is triggered on the
> parentIdendifier:
>
> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
> new String[] { "content" }, new String[][] { { "someContent" } });
>
> Then the childIdentifier is processed and the following code is triggered
> in the processDocument method:
>
> final String[] contentArray =
> activities.retrieveParentData(childIdentifier, "content");
>
> At this point, the childIdentifier correctly retrieve a contentArray
> containing 1 value which is "someContent"
>
> ---------------------------- Second crawl
> -----------------------------------------
>
> In the processDocument method the following code is triggered on the
> parentIdendifier:
>
> activities.addDocumentReference(childIdentifier, parentIdentifier, null,
> new String[] { "content" }, new String[][] { { "newContent" } });
>
> Then the childIdentifier is processed and the following code is triggered
> in the processDocument method:
>
> final String[] contentArray =
> activities.retrieveParentData(childIdentifier, "content");
>
> At this point, the childIdentifier retrieves a contentArray containing 2
> values, the old one "someContent", and the new one "newContent"
>
> I can guarantee that the parentIdentifier between the two crawls is the
> same and that on the second crawl, only the "newContent" is added, I
> debugged the code to confirm everything.
>
>
>
> Julien
>
>
> -----Message d'origine-----
> De : Karl Wright <daddy...@gmail.com>
> Envoyé : dimanche 21 mars 2021 16:05
> À : dev <dev@manifoldcf.apache.org>
> Objet : Re: How to override carry down data
>
> Can you give me a code example?
> The carry-down information is set by the parent, as you say.  The specific
> information is keyed to the parent so when the child is added to the queue,
> all old carrydown information from the same parent is deleted at that time,
> and until that happens the carrydown information is preserved for every
> child.  As you say, it can be augmented by other parents that refer to the
> same child, but it is never *replaced* by carrydown info from a different
> parent, just augmented.
>
> If it didn't work this way, MCF would have horrendous order dependencies
> in what documents got processed first.  As it is, when the carrydown
> information changes because another parent is discovered, the children are
> queued for processing to achieve stable results.
>
> Karl
>
>
> On Sun, Mar 21, 2021 at 10:45 AM <julien.massi...@francelabs.com> wrote:
>
> > Hi Karl,
> >
> >
> >
> > I am using carry-down data in a repository connector but I have
> > figured out that I am unable to update/override a value that already
> have been set.
> > Indeed, despite I am using the same key and the same parent
> > identifier, the values are stacked. So, when I retrieve carry-down
> > data through the key I get more and more values in the array instead of
> only one that is updated.
> > It seems I misunderstood the documentation, I was believing that the
> > carry-down data values are stacked only if there are several parent
> > identifiers for the same key.
> > What can I do to maintain only one carry-down data value for a given
> > key and a given parent identifier ?
> >
> >
> >
> > Regards,
> >
> > Julien
> >
> >
> >
> >
>
>

Reply via email to