Looking at the interface for protocol plugin, I notice the result must
always be ProtocolResult. But how shall this result be constructed
correctly?

https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/protocol/Protocol.java#L40


ProtocolOutput is a combination of content and status.  While status can
be guessed somehow, the content itself is built from url, base, content
contentType, metadata and conf.

https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/protocol/Content.java#L67


What do the different fields mean?
Why is there a distinction between url and base?
Why is content a byte[] that forces the plugin to load all the data into
RAM? What should happen if a protocol plugin would hit a big resource
(as in bigger than 2 GB)?
Why would a protocol plugin know about the content type?
What should Metadata contain, and where will it be stored and used? I
have not seen the indexer forwarding the metadata to solr for example.
What configuration should be passed into Content?


Some explanatory javadoc or wiki page would be much appreciated.

Reply via email to