I agree with both of these concepts: only GC'ing things which are not in the index and also not reachable from something that is in the index, and making GC'ing (mostly) automatic, based on thresholds, etc, when a component exits back to the framework. This would be fine for now - if use cases come up where some more programmatic control of this is needed, we could add something.
Maybe the next thing to focus on is the "contract" re: GC running. For a component (primitive or aggregate), the proposed contract is to have the GC not change the FS "id"s that existed prior to the component running. This is a tradeoff - for more stability with existing handle uses, versus less "aggressive" GC's. -Marshall Thilo Goetz wrote: > Adam Lally wrote: > >> On Wed, Mar 11, 2009 at 8:53 AM, Marshall Schor <m...@schor.com> wrote: >> >>> I agree in general about not making things more complicated at least to >>> the user. I can imagine education working for >>> 1) things like string interning >>> 2) things like deleting features from type systems where they're not >>> being used, and where the annotator producing them will respect this. >>> >>> What this approach seems to miss are the following kinds of things: >>> >>> 1) cases where some set of annotators produce feature structures, which, >>> after some point, are no longer needed, and are "deleted" but >>> never-the-less continue to consume space. >>> >>> 2) cases where some set of annotators produce feature structures having >>> lots of fields, where, after some point, the fields are no longer needed. >>> >>> If these are not significant use-cases in practice, then I'm happy to >>> think-about / work-on other things :-). >>> >>> >> I'd like to propose discussing the different ideas here one at a time. >> We had enough trouble coming to any agreement on GC the last time >> that we discussed it, without also throwing string interning and >> feature deleting into the mix. >> >> So focusing on GC first (unless you think one of the others is more >> important): >> >> My inclination is to assure that GC deletes only garbage, and that >> there's no possibility that anything GC'ed could have been referenced >> by anybody. The other proposals that don't have this guarantee are >> scary to me. >> >> A way to accomplish this guarantee would be that when the process >> method of an AnalysisEngine (could be either primitive or aggregate) >> completes, we can mark as garbage any FS's that were created since the >> beginning of that process method, but which are not referenced >> directly or indirectly from anything in the indexes. Does this >> concept seem reasonable? >> > > +1. I like the idea because it is sort of local on the one > hand, but still allows one to delete FSs from indexes > later in the processing and have them garbage collected > (on exiting the containing aggregate). > > >> The next question is under what conditions would a GC execute. >> Requiring an explicit call seems counter to what other garbage >> collecting runtime environments do, and like Thilo I'm confused about >> who would call this and when. I think it would be better to define >> the parameters that control GC in the PerformanceTuningSettings that >> we already have, and make them dependent on how much CAS heap space is >> used relative to a GC threshold that the user has set in the >> PerformanceTuningSettings. >> > > +1, and the default could be "no GC", so it would be > perfectly backwards compatible. I'm thinking of the > kinds of scenarios that I often work with, where > basically all the annotations are later written to > an index, and any attempt at GC would be futile and > just consume time to no benefit. > > >> -Adam >> > > > >