Thanks for doing this investigation. I'll take a look at what else is in the PoU Parse State that might be worth playing similar copy-on-write tricks on.
But I believe profiling is the next step on this. On Tue, Jan 9, 2024 at 5:34 PM Steve Lawrence <slawre...@apache.org> wrote: > And here's where we do some logic and a more detailed comment about it: > > > https://github.com/apache/daffodil/blob/main/daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/parsers/PState.scala#L346-L362 > > So I think we do already do copy-on-write for variables when parsing. > > > On 2024-01-09 05:28 PM, Steve Lawrence wrote: > > There's actually a comment in the PState captureFrom() method used to > > capture state during PoUs: > > > > // Note that this is intentionally a shallow copy. This normally would > > // not work because the variable map is mutable so other state changes > > // could mutate this snapshot. This is avoided by carefully changing the > > // PState variable map to a deep copy of this variable map right before a > > // change is made. This essentially makes the PState variable map behave > > // as copy-on-write. > > this.variableMap = ps.variableMap > > > > Assuming that is all true and done correctly, we might actually already > > do what you suggest, at least for variables. But there might be other > > parts of PState that we that woudl improve performance by changing to > > copy-on-write. We may want to do some profiling on formats with lots of > > PoUs to see if anything shows up. > > > > > > On 2024-01-09 04:03 PM, Mike Beckerle wrote: > >> Actually, I haven't measured it, but there are 4 built in variables, so > >> even if a schema introduces no new variables of its own there is > overhead > >> to deal with copying the state of 4 variables just in case you need to > >> backtrack them, and this overhead occurs for every point of uncertainty. > >> > >> Also more and more schemas are using variables. We're finding them very > >> very useful. > >> > >> Nevertheless I think the vast bulk of points of uncertainty will come > and > >> go with no variables being touched. They tend to get used for specific > >> things, but not all over the place. > >> > >> For example, several schemas have a feature to capture bad data into a > >> hexBinary Blob element so as to be able to keep parsing a large file, > >> instead of failing on the first bad data item. > >> Whether they do this or just fail is controlled by a variable. But that > >> variable is not touched unless legal parsing fails. So one would hope > the > >> vast bulk of the data processing would never touch that variable, yet > >> every > >> single record in the data file is a point of uncertainty. > >> > >> On Tue, Jan 9, 2024 at 1:49 PM Larry Barber <larry.bar...@nteligen.com> > >> wrote: > >> > >>> Seems like the benefit would only be significant if you were dealing > >>> with > >>> lots of variables. > >>> > >>> -----Original Message----- > >>> From: Mike Beckerle <mbecke...@apache.org> > >>> Sent: Tuesday, January 9, 2024 1:39 PM > >>> To: dev@daffodil.apache.org > >>> Subject: Thoughts on on demand copying of parser state > >>> > >>> Right now we copy the state of the parser as every point of > >>> uncertainty is > >>> reached. > >>> > >>> I am speculating that we could copy on demand. So, for example, if no > >>> variable modifying operation occurs then there would be no overhead to > >>> copy the variable state. > >>> > >>> This comes at the cost of each variable doing an additional test of > >>> whether the variable state needs to be copied first. > >>> > >>> Thoughts? > >>> > >> > > > >