> When you look at the protocol-smb hook it comes with this static hook,
> but as it is never executed does not help.

Yes, it has to be called.

> That is understood. Although I could think of two other exercises that would 
> help:
> - create a tutorial to add some arbitrary protocol (e.g. the foo://bar/baz 
> url)
> - modify the protocol-smb plugin to make use of the smbclient binary.
>
> I'd be willing to do the latter but would like to see a less clumsy behaviour 
> for plugins.

Great! Nutch could not exist without voluntary work. Thanks!

Sorry, that integration will not be that easy. The problem was indeed already 
known since long and
should have been better tested, see also [1] and [2] - the class
org.apache.nutch.protocol.sftp.Handler (a dummy handler) has been lost, you'll 
find it in the zip
file attached to NUTCH-714.

However, encapsulation and lazy instantiation I would not call "clumsy 
behavior", it's useful
for heavy-weight plugins (e.g., parse-tika which brings 50 MB dependencies).

A solution should be possible. Actually, it's easy for protocol-sftp on 2.x:
 - place the dummy handler from the zip file in
    src/java/org/apache/nutch/protocol/sftp/Handler.java
   and recompile
 - register the package org.apache.nutch.protocol as handler
    e.g. NUTCH_OPTS=-Djava.protocol.handler.pkgs=org.apache.nutch.protocol
    but that could be done also from inside Nutch

   NUTCH_OPTS=-Djava.protocol.handler.pkgs=org.apache.nutch.protocol \
     .../2.x/runtime/local/bin/nutch parsechecker 
-Dplugin.includes='protocol-sftp' sftp://..

   (no MalformedURLException)

 - in case the handler class shall live in the plugin's protocol-sftp.jar, some 
more work would
   be necessary to make it available also in the main class loader, not only in 
the plugin's child
   class loader. But as it could be a small class without dependencies, a 
solution should be
   possible.

Thanks, looking forward how you get it solved,
Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-714
[2] 
http://grokbase.com/t/nutch/dev/1192bgy9fc/protocol-not-found-or-malformedurl-protocol-sftp

> Adding the plugin plus modifying config files should be enough in my eyes.
On 09/19/2017 06:52 AM, Hiran CHAUDHURI wrote:
>> Hi Hiran,
>>
>> ok, got it. - the problem is already given in 
>> https://issues.apache.org/jira/browse/NUTCH-427 :)
> 
> Indeed - when rereading that article it exactly describes my perception.
> 
>> In this case, you're right. The plugin system wasn't designed to manipulate 
>> Java system properties.
> 
> If it does not then setting the system property when using the crawl script 
> should have helped - but then I probably missed putting the jar into the 
> system classpath.
> 
>> But it should be possible to do it by adding a static hook which is called 
>> before instantiation.
> 
> When you look at the protocol-smb hook it comes with this static hook, but as 
> it is never executed does not help.
> 
>> The second problem would be the class loader encapsulation: the class 
>> java.net.URL is used in many places and the protocol handler 
>> (jcifs.smb.Handler) must be globally available.
> 
> True. That is where I almost assumed the Nutch configuration code would at 
> some point collect all the protocol plugins (everything registered to the 
> protocol extension point) and set the system property globally but could not 
> find it.
> 
>> But be pragmatic - protocol-smb will not make it into the "official" Nutch 
>> package because of the LGPL license [1]. 
> 
> That is understood. Although I could think of two other exercises that would 
> help:
> - create a tutorial to add some arbitrary protocol (e.g. the foo://bar/baz 
> url)
> - modify the protocol-smb plugin to make use of the smbclient binary.
> 
> I'd be willing to do the latter but would like to see a less clumsy behaviour 
> for plugins. Adding the plugin plus modifying config files should be enough 
> in my eyes.
> 
>> To make protocol-smb working for "your" Nutch package:
>>
>> 1. set the system property accordingly. If you use bin/nutch, modify it or 
>> pass it via the environment variable
>>    export NUTCH_OPTS=-Djava.protocol.handler.pkgs=jcifs
>>
>> 2. make sure that the jcifs jar is added as global dependency
>>    - add it to ivy/ivy.xml
>>    - or copy it to runtime/local/lib/  (local mode for quick testing)
>>   (or alternatively copy the jcifs/smb/Handler.java and dependencies
>>    to your source tree)
> 
> So far I used
> bin/crawl --index -D  solr.server.url=http://172.17.0.9:8983/solr/nutch -D  
> java.protocol.handler.pkgs=jcifs urls crawl
> 
> but I will try your hints. Will need a few days for this.
> 
> Hiran 
> 

Reply via email to