Thanks for doing this investigation.

I'll take a look at what else is in the PoU Parse State that might be worth
playing similar copy-on-write tricks on.

But I believe profiling is the next step on this.

On Tue, Jan 9, 2024 at 5:34 PM Steve Lawrence <slawre...@apache.org> wrote:

> And here's where we do some logic and a more detailed comment about it:
>
>
> https://github.com/apache/daffodil/blob/main/daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/parsers/PState.scala#L346-L362
>
> So I think we do already do copy-on-write for variables when parsing.
>
>
> On 2024-01-09 05:28 PM, Steve Lawrence wrote:
> > There's actually a comment in the PState captureFrom() method used to
> > capture state during PoUs:
> >
> > // Note that this is intentionally a shallow copy. This normally would
> > // not work because the variable map is mutable so other state changes
> > // could mutate this snapshot. This is avoided by carefully changing the
> > // PState variable map to a deep copy of this variable map right before a
> > // change is made. This essentially makes the PState variable map behave
> > // as copy-on-write.
> > this.variableMap = ps.variableMap
> >
> > Assuming that is all true and done correctly, we might actually already
> > do what you suggest, at least for variables. But there might be other
> > parts of PState that we that woudl improve performance by changing to
> > copy-on-write. We may want to do some profiling on formats with lots of
> > PoUs to see if anything shows up.
> >
> >
> > On 2024-01-09 04:03 PM, Mike Beckerle wrote:
> >> Actually, I haven't measured it, but there are 4 built in variables, so
> >> even if a schema introduces no new variables of its own there is
> overhead
> >> to deal with copying the state of 4 variables just in case you need to
> >> backtrack them, and this overhead occurs for every point of uncertainty.
> >>
> >> Also more and more schemas are using variables. We're finding them very
> >> very useful.
> >>
> >> Nevertheless I think the vast bulk of points of uncertainty will come
> and
> >> go with no variables being touched. They tend to get used for specific
> >> things, but not all over the place.
> >>
> >> For example, several schemas have a feature to capture bad data into a
> >> hexBinary Blob element so as to be able to keep parsing a large file,
> >> instead of failing on the first bad data item.
> >> Whether they do this or just fail is controlled by a variable. But that
> >> variable is not touched unless legal parsing fails. So one would hope
> the
> >> vast bulk of the data processing would never touch that variable, yet
> >> every
> >> single record in the data file is a point of uncertainty.
> >>
> >> On Tue, Jan 9, 2024 at 1:49 PM Larry Barber <larry.bar...@nteligen.com>
> >> wrote:
> >>
> >>> Seems like the benefit would only be significant if you were dealing
> >>> with
> >>> lots of variables.
> >>>
> >>> -----Original Message-----
> >>> From: Mike Beckerle <mbecke...@apache.org>
> >>> Sent: Tuesday, January 9, 2024 1:39 PM
> >>> To: dev@daffodil.apache.org
> >>> Subject: Thoughts on on demand copying of parser state
> >>>
> >>> Right now we copy the state of the parser as every point of
> >>> uncertainty is
> >>> reached.
> >>>
> >>> I am speculating that we could copy on demand. So, for example, if no
> >>> variable modifying operation occurs then there would be no overhead  to
> >>> copy the variable state.
> >>>
> >>> This comes at the cost of each variable doing an additional test of
> >>> whether the variable state needs to be copied first.
> >>>
> >>> Thoughts?
> >>>
> >>
> >
>
>

Reply via email to