Github user bwmcbride commented on the pull request:
https://github.com/apache/jena/pull/123#issuecomment-174351980
Thanks for all the comments. I've updated the pull request implementing
all the proposed changes and made a few other minor amendments.
However, looking at @afs comments again, I realise I am heading off down
the wrong path. What @afs had suggested was not memoizing the text index
objects in TextDataset factory, which is what this PR currently does, but
memoizing dataset objects.
My bad. I thought the suggestion was to memoize the TextDataSetLucene
objects in TextDatasetFactory. That does I think solve JENA-1122 (as I
understand it) but, as Andy says, it opens up other nasty problems.
So I propose to switch to memoizing datasets in the assembler.
I'm sorry the rest of this is so long. The short version is, what is the
best way to go about managing state for memoizing datasets in assemblers?
The simple option is to just move the code in this pull request for
memoizing TextIndexLucene datasets from TextDatasetFactory to
TextIndexLuceneAssembler. That addresses my immediate problem, but a solution
that worked for datasets in general would be better.
A problem is if/when/how to clear out the memoized state.
In use cases where an application or a service starts up, assembles its
components and then does what it does until it terminates, there may not be
much of an issue leaving state around in the assembler maps used to relate
nodes to reusable objects.
In use cases where an application is repeatedly building and tearing down
assemblies during its lifetime then not clearing out the memoizing state can
lead to failures when a component is reused between 'builds'. The current
TextIndexLucene test cases do this and they fail if the memoizing map is not
cleared out between tests. Maybe the tests are broken.
The current code in this pull request clears a TextIndexLucene object out
of the memoizingmap when the TextIndexLucene object is closed. But that is
basically a kludge. I don't think it a good idea to go round changing all
existing dataset implementations to support event handling callbacks on close.
The way I would naturally expect things to work is for assemblers to have
some notion of a build. A build defines a context in which state like the
memoizing map can be built up. The build context can be thrown away at the end
of the build and a new one created for the next build. This would prevent
reuse across builds. So if you do:
foo = assembler.build(R);
bar = assembler.build(R);
you will get two different assemblies, typically with no sharing between
them. (You would still get a lock failure if R had a TextIndexLucene component
that was not closed between creating the two assemblies.)
As far as I can see assemblers are not designed to work like this. Nor can
I see how to add this notion of a build context without affecting many existing
assemblers. I may be missing something.
I have raised the question in the hope that someone will suggest an
approach to a more general solution.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---