Re: [basex-talk] Performance issue on copy-modify-return when too many items are being replaced

France Baril Fri, 12 Nov 2021 05:44:43 -0800

Forwarding to the mailing list in order to share knowledge.

On Fri, Nov 12, 2021 at 1:41 PM BaseX Support <supp...@basex.org> wrote:


> Hi France,
>
> I’d need to get my hands on your code to tell you exactly where it’s
> best used, but I can give you some more details on the XQuery
> specification:
>
> When creating new nodes in XQuery via node constructors [1], copies of
> all enclosed nodes will be created, and the copied nodes get new node
> identities. As a result, the following query yields false:
>
> let $a := <a/>
> let $b := <b>{ $a }</b>
> return $b/a is $a
>
> This step can be very expensive and memory consuming. If the option is
> enabled, child nodes will only be linked to the new parent nodes, and
> the upper query returns true.
>
> As the option changes the semantics of XQuery, it should preferably be
> used in Pragmas.
>
> Best,
> Christian
>
> PS: Mails to our mailing list are preferred; this way, other users
> might benefit from the replies as well.
>
> [1] https://www.w3.org/TR/xquery-31/#id-constructors
>
>
>
> On Fri, Nov 12, 2021 at 2:13 PM France Baril
> <france.ba...@architextus.com> wrote:
> >
> > Can you give me more information about how copynode changes the behavior
> of the xquery and where it is best used.
> >
> > I see in the example that the pragma is on db:open. My process is:
> >
> > 1. Read a document A from DB called lang that has references to other
> documents in the same DB lang (where lang is a 4 letter code for a locale).
> > 2. Merge all the references into document A to create an aggregate.
> > 3. Send the aggregate through multiple functions (that use
> copy-modify-return) that each resolve a type of reference (most references
> grab referenced content from a DB called global, but others grab it from
> the lang DB). These references do not grad entire documents, but smaller
> snippets within XML documents.
> > 4. Save the result in a DB called staging-lang (where lang is a 4 letter
> code for a locale).
> >
> > So should the pragma apply when reading the 1st document (1), when
> reading the documents we aggregate into the 1st document (2), when grabbing
> the snippets (3) and/or when saving the end result in the staging DB (4)?
> Or maybe for all db:open() and db:attribute()/.. functions in this process?
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Nov 12, 2021 at 12:16 PM BaseX Support <supp...@basex.org>
> wrote:
> >>
> >> One more suggestion:
> >>
> >> If node construction turns out to consume too much memory, it sometimes
> helps to disable the COPYNODE option:
> >>
> >> https://docs.basex.org/wiki/XQuery_Extensions#Database_Pragmas
> >>
> >>
> >>
> >> France Baril <france.ba...@architextus.com> schrieb am Fr., 12. Nov.
> 2021, 13:09:
> >>>
> >>> Hi,
> >>>
> >>> Thanks for your answer.
> >>>
> >>> I tried rebuilding the document instead of using copies, I have
> >>> implemented 3/4 of the functions that resolve references and I'm
> >>> already at double the time I had before. So I will set that one aside
> >>> as an unsuccessful alternative. If memory serves me correctly we might
> >>> have moved from a transform that rebuilds the document to a
> >>> copy-modify-return approach to improve performance over a year ago.
> >>>
> >>> I will try grouping the references of the same names in the example
> >>> above to limit the number of queries to the DB. If that still doesn't
> >>> help, I will see if I can send you a good example without having to
> >>> send too many of our.
> >>>
> >>> We have a short term solution where we removed some references in
> >>> references, which reduces substantially the number of items to resolve
> >>> (80% improvement), but it does impact the user experience, so we are
> >>> still looking into code-based solutions as opposed to (or to use in
> >>> conjunction with) content-based solutions.
> >>>
> >>> On Fri, Nov 5, 2021 at 5:22 PM BaseX Support <supp...@basex.org>
> wrote:
> >>> >
> >>> > Hi France,
> >>> >
> >>> > Do you have some sample data that allows us to test your code?
> >>> >
> >>> > If documents are pretty large, it’s sometimes faster to rebuild a
> >>> > document with node constructors instead of performing updates on it.
> >>> >
> >>> > Best,
> >>> > Christian
> >>> > ____________________________________
> >>> >
> >>> > > We have a query that looks like this:
> >>> > >
> >>> > > declare function content-refs:resolve-prompt-refs-new($node as
> node(),
> >>> > > $lang as xs:string) as node()*{
> >>> > >    let $result :=
> >>> > >      copy $copy := $node
> >>> > >      modify(
> >>> > >           let $entries :=
> >>> > > $copy/descendant-or-self::*[@name-ref][name()='prompt-ref' or
> >>> > > name()='gui-ctrl-ref'
> >>> > >             or name()='feature-ref' or name()='app-ref' (: or
> >>> > > name()='screen-ref':)]
> >>> > >
> >>> > >           let $entries-hd :=
> >>> > >
> $copy/descendant-or-self::*[@id='T1700243243']/descendant-or-self::*[@name-ref][name()='prompt-ref'
> >>> > > or name()='gui-ctrl-ref'
> >>> > >             or name()='feature-ref' or name()='app-ref' (: or
> >>> > > name()='screen-ref':)]
> >>> > >
> >>> > >           let $trace := trace('Prompts count: ' || count($entries))
> >>> > >           let $trace := trace('Prompts in Hardware diagram: ' ||
> >>> > > count($entries-hd))
> >>> > >
> >>> > >           for $entry in $entries
> >>> > >           (:let $trace := trace('start processing entry'):)
> >>> > >           let $name := $entry/data(@name-ref)
> >>> > >           let $trace :=
> >>> > >             if (exists($entry/ancestor::*[@id = 'T1700243243']))
> >>> > >             then trace( $name , ' Promptref&#10;')
> >>> > >             else ()
> >>> > >           let $prompts-from-index := db:attribute('index-prompt-'
> ||
> >>> > > $lang, $name, 'name')/.. (:=> prof:time('index prompt attr: '):)
> >>> > >           (:let $prompts-from-index := db:open('index-prompt-' ||
> >>> > > $lang)//*[@name = $name] => prof:time('index prompt open: '):)
> >>> > >           let $prompts :=
> >>> > >             for $prompt in $prompts-from-index
> >>> > >             let $original-elem-name := $entry/self::*/name()
> >>> > >             let $new-elem-name :=
> >>> > >                switch ($original-elem-name)
> >>> > >                case 'prompt-ref' return $original-elem-name
> >>> > >                default return
> substring-before($original-elem-name, '-ref')
> >>> > >             return
> >>> > >                copy $prompt-renamed := $prompt
> >>> > >                modify(
> >>> > >                   rename node $prompt-renamed as $new-elem-name
> >>> > >                )
> >>> > >                return $prompt-renamed (:=> prof:time('index prompt
> new
> >>> > > elem-name: '):)
> >>> > >           let $new-node :=
> >>> > >             if (count($prompts) = 0)
> >>> > >             then
> >>> > >                <filter-group error="{concat("No target found in
> for:&#32;",
> >>> > >                   $entry/name(), '/@name-ref=',
> $entry/@name-ref)}"/>
> >>> > >             else <filter-group-inline>{
> >>> > >                $prompts
> >>> > >              }</filter-group-inline>
> >>> > >           let $trace := ('Ready to replace old entry with
> new-node')
> >>> > >           return replace node $entry with $new-node (:=>
> >>> > > prof:time('index prompt new node: '):)
> >>> > >
> >>> > >      )
> >>> > >      return $copy  (:=> prof:time('index prompt return copy: '):)
> >>> > >    return $result
> >>> > >
> >>> > > };
> >>> > >
> >>> > > As you can see, we are using prof:time to see how quickly items are
> >>> > > resolved. Querying to the db for each item goes fairly quickly (2
> >>> > > seconds). However that last 'return $copy' line, after all the
> >>> > > replacements are processed takes between 11 and 25 minutes
> depending
> >>> > > on the system. Memory usage is low, but the CPU usage goes to the
> >>> > > roof.
> >>> > >
> >>> > > We are updating a little over 110 000 items in this operation, so
> it
> >>> > > is a big operation on a file of about 89000 indented lines. We are
> >>> > > wondering if there is a way we could improve the performance.
> Before
> >>> > > this operation occurs, we are processing the file multiple times to
> >>> > > replace other items with very similar functions
> (copy-modify.return),
> >>> > > they all go fairly quickly so it does seem that the culprit is the
> >>> > > number of items being replaced.
> >>> > >
> >>> > >
> >>> > > --
> >>> > > France Baril
> >>> > > Architecte documentaire / Documentation architect
> >>> > > france.ba...@architextus.com
> >>>
> >>>
> >>>
> >>> --
> >>> France Baril
> >>> Architecte documentaire / Documentation architect
> >>> france.ba...@architextus.com
> >
> >
> >
> > --
> > France Baril
> > Architecte documentaire / Documentation architect
> > france.ba...@architextus.com
>


-- 
France Baril
Architecte documentaire / Documentation architect
france.ba...@architextus.com

Re: [basex-talk] Performance issue on copy-modify-return when too many items are being replaced

Reply via email to