This is great feedback and we can use it to essentially reorganize and reimagine that the Nutch PluginCentral documentation should be.
I'll loop back to your original questions regarding plugin lifecycle shortly. lewismc On 2024/10/07 19:37:45 Hiran Chaudhuri wrote: > Hello Lewis. > > I had a look and it seems to be the pages in PluginCentral do cover > questions that were raised at some point in time but they do not round > off to a full plugin documentation. > > There are two lists of pages: Information about the plugin system/plugin > development, and a list of plugins you can download. While the latter > contains example plugins I'd expect the groundwork to be covered in the > first list. So let's look at that: > > There is an introduction why there is a plugin system at all, and a > general information about plugins. These two pages could actually make > one highlevel introduction. > > The technical concepts is where I'd expect more details, like the > classloading principle and the life cycle of plugins. Means of > communicating with the outside world (access configuration data, store > temporary data, what to do with oversized content, fetch dates and 'not > modified since' responses. Explain the difference between a recource > that was 'not found' vs one that 'is gone'. Or whether the response to > getRobotRules() should be cached and if so for how long. This may end up > with a common part and then specialized pages for each of the plugin types. > > The remaining pages describe special problems or tutorials applicable to > special cases: WritingPluginExample cares about writing IndexingFilter > and ScoringFilter. Not useful for someone looking into Protocol plugins. > The next describes writing an indexing filter (again?). PluginGotchas > describes a problem during compilation. Well, I never had one. > And Tika? I'm trying my luck on Protocol plugins. > > Thus I can say that PluginCentral does not cover the questions I have > raised so far. > > Hiran > > > On 07.10.24 19:47, Lewis John McGibbney wrote: > > Hi Hiran, > > > > If you haven't already please take a look at > > https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral and see if > > any of your questions are answered. If we need to augment the documentation > > then we can do that. Please let us know if this is the case. > > > > lewismc > > > > On 2024/10/06 20:13:35 Hiran Chaudhuri wrote: > >> I was experimenting with the protocol plugin that continually connects > >> and disconnects from the server for each and every request. > >> HTML may be lightweight (or cached in the httpclient code), but other > >> protocols are not. > >> > >> My code was ruthless about establishing and tearing down the > >> connections, but it looked very repetitive for getProtocolOutput and > >> getRobotRules. > >> Trying to make functions reusable first of all led to loss of complete > >> control on the connection. No worries, they get garbage collected - > >> don't they? > >> > >> Well it seems these connections get closed and gc'ed but it takes too > >> much time. Inbetween the fetcher hits problems and runs into grace > >> periods of 300 000 milliseconds. The total scan becomes unperformant > >> just because I tried to optimize the code. Which leads me to the next > >> question: > >> > >> What is the plugin's life cycle? Is there one plugin instance per > >> server? One per URL? One per thread? Or one in total? > >> This scope defines whether I can make use of local variables, or > >> instance fields. Or is there some other mechanism where a plugin could > >> store data that should survive across the getProtocolOutput calls? Could > >> a plugin define which scope it wants to be in? > >> > >> >

