Re: Understand the code: components of ProtocolResult

Lewis John McGibbney Fri, 04 Oct 2024 13:13:21 -0700

Hi Hiran,

On 2024/10/01 21:36:37 Hiran Chaudhuri wrote:
> Looking at the interface for protocol plugin, I notice the result must
> always be ProtocolResult.


Where are you getting that from? That is not defined in the Interface...

> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/protocol/Protocol.java#L40
> 
> 
> ProtocolOutput is a combination of content and status.  

Correct. To be clear 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/protocol/Content.java,
 and
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/protocol/ProtocolStatus.java

> While status can
> be guessed somehow, the content itself is built from url, base, content
> contentType, metadata and conf.

Yes OK.

> What do the different fields mean?

Firstly, we should definitely augment the Javadoc explaining this, I created 
https://issues.apache.org/jira/browse/NUTCH-3074 for that. Thanks for pointing 
this out.

To answer your question
* url - the key for the record associated with the Content. Because Content 
implement's  org.apache.hadoop.io.Writable we need this key
* base - the base url for relative links contained in the content. This may be 
be different from url parameter value if the request redirected. In _lots_ of 
places across the Nutch codebase, no distinction is made between the value 
passed to 'base' Vs 'url' e.g. in lots of places they are treated as identical. 
The TestContent shows how they value can differ but we could certainly 
encourage more unit tests to emphasize this. See 
https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/protocol/TestContent.java
* content - the raw content encoded as a byte array with optional character set
* contentType - the detected mime/media type for the above content prior to 
encoding
* metadata - this is metadata about the protocol, specific to the retrieval of 
the content
* conf - if this varaiable is passed then MimeUtil (Apache Tika mimetype 
detection) is triggered with the goal of a more accurate detection capability. 
See 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L70

> Why is there a distinction between url and base?

See answer above. Some unit tests would really make this better. We should 
however augment the Javadoc (also mentioned above).

> Why is content a byte[] that forces the plugin to load all the data into
> RAM? 

I didn't write this original code but here are some of my thoughts
1. byte[] facilitates (optional) predictable character encoding which can 
prevent data loss or even counter security vulnerabilities.
2. because the data encoded in the byte[] is binary, the performance gain of 
using them in place of string of bytes, is significant.
3. Not all byte sequences are a valid String, so, when one stores bytes in a 
String, the data needs to be validated. That not required with byte[].
4. Because if it is in RAM then there _may_ be some benefit of cache locality 
(temporal/spatial) depending on whether we use the byte[] later from the memory 
cache. I think this is really the response which we need to investigate 
further. 

> What should happen if a protocol plugin would hit a big resource
> (as in bigger than 2 GB)?

I think this depends. I mean, do you want to handle large artifacts like that 
or else set a limit and avoid them? This will be crawl (and crawl 
administrator) specific. For example maybe you want to process ALL archive 
files in your entire company intranet... then yes you may wish to fetch those 
types of resources. I'm sorry I don't have a better response. I don't think 
there is any surefire, static final answer for this question. It really 
depends. Nutch provides configuration to skip and truncate large content.

> Why would a protocol plugin know about the content type?

We need to capture the content type from the response (result of making the 
request over said Protocol implementation).

> What should Metadata contain, and where will it be stored and used? 

Metadata is described as being "...A multi-valued metadata container", so 
theoretically it could/can contain any key, value pair you wish _however_ as I 
detailed elsewhere in one of my responses, at the Protocol plugin level, the 
Metadata would content key, value pairs relevant to the acquisition of the 
Content over the given Protocol. A simple example could be

metadata.set(Response.CONTENT_TYPE, "text/plain; charset=UTF-16");

> I
> have not seen the indexer forwarding the metadata to solr for example.

Correct. Protocol-level Metadata is NOT forwarded to Solr it is instead 
retained in the CrawlDB. Metadata that can be forwarded to Solr typically 
relates to the parse result of processing the Content.

> What configuration should be passed into Content?

As I detailed above, the Configuration object is used by MimeUtil. The 
properties used are
* mime.types.file - Name of file in CLASSPATH containing filename extension and
  magic sequence to mime types mapping information. Overrides the default Tika 
config
  if specified. If a file can't be found then Tika's default one is used
* mime.type.magic - Defines if the mime content type detector uses magic 
resolution. For more information see 
https://tika.apache.org/2.9.2/detection.html#Mime_Magic_Detection

> 
> 
> Some explanatory javadoc or wiki page would be much appreciated.

We can provide a patch for https://issues.apache.org/jira/browse/NUTCH-3074

I hope my responses are somewhat on point and offer some value. There are other 
on here who know much more about the Nutch codebase than me but I tried to give 
you a detailed answer. 

lewismc

Re: Understand the code: components of ProtocolResult

Reply via email to