>> From the log I have it seems the fetcher tries to resolve URLs before 
>> the PluginRepository is initialized.
>
>The Fetcher is highly concurrent, it may (even has to) start feeding the fetch 
>queues before
> fetching can start. The PluginRepository is initialized when the first plugin 
> instance is required
> (one of the protocol plugins).

Even if the fetcher is highly concurrent, at some point in time it needs to 
initialize from configuration files. Since the PluginRepository only registers 
the plugins and instantiates them on demand only, what value would be in 
loading the PluginRepository on demand only?

>We could instantiate the PluginRepository beforehand, e.g. in 
>NutchConfiguration.create().
> However, it's not guaranteed that the configuration is not changed 
> afterwards. Indeed, that's
> done sometimes, esp. in unit tests.

This confuses me. What use case would justify to change the configuration but 
not run nutch (speak: the JVM) again? I may be too new to nutch to completely 
understand.

>What's worse is that there are definitely two cases
> - in unit tests
> - in Nutch server
>where more than one Configuration is used, every configuration with its own 
>>PluginRepository!
>That's in contradiction with the "one and only" JVM-wide 
>URLStreamHandlerFactory.
>When running the unit tests ("ant test") we already get the exception
>  Caused by: java.lang.Error: factory already defined
>        at java.net.URL.setURLStreamHandlerFactory(URL.java:1112)

More than one configurations but reusing the same JVM. This sound strange to me.
Does the configuration differ that much?

Since we register the PluginRepository in a 1:1 relationship with the JVM, this 
class should become a singleton I guess.

> I see two ways to go:
> 1. be pragmatic
>   - instantiate PluginRepository in NutchConfiguration.create()
>   - set this instance as URLStreamHandlerFactory in the static method
>     PluginRepository.get(config) to make sure that the method
>     URL.setURLStreamHandlerFactory(..) is called exactly once
>   The default usage (one MapReduce job running in its own JVM)
>   will work this way. Unit tests should be easily fixed.

This would be compatible with the singleton principle.

>2. think of protocol handlers as static and more low-level,
>   e.g., implement them all to org.apache.nutch.protocol.<protocol>.Handler
>   and implement only the minimally required methods (eg. getDefaultPort()).
>   Plugins are dynamic but URLStreamHandler-s are not - they cannot
>   be changed.

In the case of the SMB protocol, a protocol handler has already been 
implemented according to JVM standards which do not match your suggestion. 
Would all such protocols require re-implementation?

I think a better way could be that we remain with the singleton 
PluginRepository instance, but whenever the configuration is reloaded this 
class would allow to reload its configuration as well.
Such an attempt works as long as you do not need to have multiple 
PluginRepository instances in the same JVM in parallel.

To work around this limitation we could create a URLStreamHandlerFactory 
singleton that registers to the JVM, knows all instances of PluginRepository 
and with some magic (I still do not understand the use case) pick the right one 
to check for the protocol and the handler to be used for each URL.

Hiran

Reply via email to