Hi Hiran, hi Yossi,

excellent idea to register the PluginRepository as URLStreamHander! It's the 
only class which knows
the protocols supported by active plugins.

> Your code call setURLStreamHandlerFactory, the documentation for which says 
> "This method can be
> called at most once in a given Java Virtual Machine". Isn't this going to be 
> a problem?

It's not called somewhere else in Nutch. Of course, it must be made sure that
setURLStreamHandlerFactory is not called by any library which loaded
before the PluginRepository is instantiated. But this should not happen.
Needs to be tested, of course. Should also ev. catch the exception if not 
called first.

> Additionally, does this URLStreamHandlerFactory successfully load the 
> standard handlers (HTTP,
> HTTPS...)? I would expect it to fail on these.

According to
https://docs.oracle.com/javase/8/docs/api/java/net/URL.html#URL-java.lang.String-java.lang.String-int-java.lang.String-
it should fall through and use the default handlers, right?

> To be able to create a pull request, your repository needs to be a fork of 
> the original
> repository, which does not seem to be the case here.

Please, also open an issue on the Nutch Jira - it's required to properly track 
the changes and
for the release notes. A pull request on github which contains the Jira ID 
(NUTCH-XXXX) is then
automatically tracked on Jira.  Also, please use the Eclipse code formatting 
template
(https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml). Thanks!

Best,
Sebastian

On 09/22/2017 11:10 AM, Yossi Tamari wrote:
> Hi Hiran,
> 
> Your code call setURLStreamHandlerFactory, the documentation for which says 
> "This method can be called at most once in a given Java Virtual Machine". 
> Isn't this going to be a problem? 
> https://docs.oracle.com/javase/8/docs/api/java/net/URL.html#setURLStreamHandlerFactory-java.net.URLStreamHandlerFactory-
> Additionally, does this URLStreamHandlerFactory successfully load the 
> standard handlers (HTTP, HTTPS...)? I would expect it to fail on these.
> 
> To be able to create a pull request, your repository needs to be a fork of 
> the original repository, which does not seem to be the case here.
> 
>       Yossi.
> 
> -----Original Message-----
> From: Hiran CHAUDHURI [mailto:hiran.chaudh...@amadeus.com] 
> Sent: 22 September 2017 11:54
> To: user@nutch.apache.org
> Subject: RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?
> 
> Hello all.
> 
> This time following up on my own post...
> 
>>>> When you look at the protocol-smb hook it comes with this static 
>>>> hook, but as it is never executed does not help.
>>>
>>> Yes, it has to be called.
>>
>> So when would Nutch call this static hook? In practice this does not happen 
>> before the plugin is required, but then it is too late as the 
>> MalformedURLException is thrown already.
>> And this aproach cannot cover the classpath issue.
> 
> It seems Nutch would never call this static hook. That is why I patched the 
> PluginRepository class.
> 
>>>> - create a tutorial to add some arbitrary protocol (e.g. the 
>>>> foo://bar/baz url)
>>>> - modify the protocol-smb plugin to make use of the smbclient binary.
>>>>
>>>> I'd be willing to do the latter but would like to see a less clumsy 
>>>> behaviour for plugins.
>>>
>>> Great! Nutch could not exist without voluntary work. Thanks!
>>>
>>> Sorry, that integration will not be that easy. The problem was indeed 
>>> already known since long and should have been better tested, see also [1] 
>>> and [2] - the class org.apache.nutch.protocol.sftp.Handler (a dummy 
>>> handler) has been lost, you'll find it in the zip file attached to 
>>> NUTCH-714.
>>>
>>> However, encapsulation and lazy instantiation I would not call "clumsy 
>>> behavior", it's useful for heavy-weight plugins (e.g., parse-tika which 
>>> brings 50 MB dependencies).
>>
>> Both concepts, encapsulation and lazy instantiation are great. What I call 
>> clumsy is that the encapsulation does not work. Look at it from a user 
>> perspective of the protocol-smb plugin.
>> It comes as a (set of) jars, together with an XML descriptor. This could be 
>> nicely wrapped in a zip file and thus is one artifact that can easily be 
>> versioned and distributed.
>>
>> But as soon as I want to install it, I have to
>> 1 - put the artifact into the plugins directory
>> 2 - modify Nutch configuration files to allow smb:// urls plus include 
>> the plugin to the loaded list
>> 3 - extract jcifs.jar and place it on the system classpath
>> 4 - run nutch with the correct system property
>>
>> While items 1 and 2 can be understood easily and maybe one day come with a 
>> nice management interface, items 3 and 4 require knowledge about the 
>> internals of the plugin. 
>> Where did the encapsulation go? This is where I'd like to improve, and I 
>> have an idea how that could be established. Need to test it though.
> 
> I have a solution that makes steps 3 and 4 obsolete.
> 
>> I would need the first to test modifications to the plugin system.
>> Then with the second I would create a smb plugin that would suffer 
>> other limitations than the LGPL. ;-)
> 
> So here is the solution to the first step - the modified plugin system. It is 
> available here, however I am not sure how to create the pull request...
> https://github.com/HiranChaudhuri/nutch/commit/dc9cbeb3da7ca021e2cce322482d2eaa1ec15b28
> 
> Next will be one example plugin and the mentioned protocol-smb.
> 
> Hiran
> 

Reply via email to