Looking at the interface for protocol plugin, I notice the result must always be ProtocolResult. But how shall this result be constructed correctly?
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/protocol/Protocol.java#L40 ProtocolOutput is a combination of content and status. While status can be guessed somehow, the content itself is built from url, base, content contentType, metadata and conf. https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/protocol/Content.java#L67 What do the different fields mean? Why is there a distinction between url and base? Why is content a byte[] that forces the plugin to load all the data into RAM? What should happen if a protocol plugin would hit a big resource (as in bigger than 2 GB)? Why would a protocol plugin know about the content type? What should Metadata contain, and where will it be stored and used? I have not seen the indexer forwarding the metadata to solr for example. What configuration should be passed into Content? Some explanatory javadoc or wiki page would be much appreciated.

