On 19.10.24 13:46, Sebastian Nagel wrote:
Hi Hiran,

... to answer the questions...

> What is the plugin's life cycle? Is there one plugin instance per
> server? One per URL? One per thread? Or one in total?

There is a single instance per plugin and task (a Java process running
Nutch). In local mode (not running in a distributed manner on a Hadoop
cluster), this means there is a single instance of every plugin.
So the Fetcher job holds one single instance of every protocol plugin.

Ok, this looks like the protocol plugins are treated like a singleton.
But that would imply that all the URLs (of that protocol) are treated
via this one instance. Even if they connect to different hosts, using
different user accounts etc. (this does matter for the SMB plugin).



> This scope defines whether I can make use of local variables, or
> instance fields.

Yes, you can use instance fields, e.g. to pool connections.

Good to know. But due to the above the plugin needs to manage the pool.
Different authentication on different hosts need to be kept in different
connections. This logic can be implemented, but I found one pitfall:

When can the connections be released? The plugin does not know whether
there is yet another URL for that one connection to be fetched, so all
connections need to be kept as long as possible. There is no 'close()'
method that would tell the plugin to release the resources. But as it
seems such connections would only be used by the Fetcher, and when that
one is finished the JVM terminates anyway.



> Or is there some other mechanism where a plugin could
> store data that should survive across the getProtocolOutput calls?

No.

> Could a plugin define which scope it wants to be in?

No.

Keep in mind that most methods, for example, getProtocolOutput, need
to be thread-safe. That is they might be called concurrently from
multiple Fetcher threads.

Yes, this is information I was after. Thank you for providing it. :-)


Hiran


Reply via email to