Hi Hiran, On 2024/10/01 21:36:37 Hiran Chaudhuri wrote: > Looking at the interface for protocol plugin, I notice the result must > always be ProtocolResult.
Where are you getting that from? That is not defined in the Interface... > https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/protocol/Protocol.java#L40 > > > ProtocolOutput is a combination of content and status. Correct. To be clear https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/protocol/Content.java, and https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/protocol/ProtocolStatus.java > While status can > be guessed somehow, the content itself is built from url, base, content > contentType, metadata and conf. Yes OK. > What do the different fields mean? Firstly, we should definitely augment the Javadoc explaining this, I created https://issues.apache.org/jira/browse/NUTCH-3074 for that. Thanks for pointing this out. To answer your question * url - the key for the record associated with the Content. Because Content implement's org.apache.hadoop.io.Writable we need this key * base - the base url for relative links contained in the content. This may be be different from url parameter value if the request redirected. In _lots_ of places across the Nutch codebase, no distinction is made between the value passed to 'base' Vs 'url' e.g. in lots of places they are treated as identical. The TestContent shows how they value can differ but we could certainly encourage more unit tests to emphasize this. See https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/protocol/TestContent.java * content - the raw content encoded as a byte array with optional character set * contentType - the detected mime/media type for the above content prior to encoding * metadata - this is metadata about the protocol, specific to the retrieval of the content * conf - if this varaiable is passed then MimeUtil (Apache Tika mimetype detection) is triggered with the goal of a more accurate detection capability. See https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L70 > Why is there a distinction between url and base? See answer above. Some unit tests would really make this better. We should however augment the Javadoc (also mentioned above). > Why is content a byte[] that forces the plugin to load all the data into > RAM? I didn't write this original code but here are some of my thoughts 1. byte[] facilitates (optional) predictable character encoding which can prevent data loss or even counter security vulnerabilities. 2. because the data encoded in the byte[] is binary, the performance gain of using them in place of string of bytes, is significant. 3. Not all byte sequences are a valid String, so, when one stores bytes in a String, the data needs to be validated. That not required with byte[]. 4. Because if it is in RAM then there _may_ be some benefit of cache locality (temporal/spatial) depending on whether we use the byte[] later from the memory cache. I think this is really the response which we need to investigate further. > What should happen if a protocol plugin would hit a big resource > (as in bigger than 2 GB)? I think this depends. I mean, do you want to handle large artifacts like that or else set a limit and avoid them? This will be crawl (and crawl administrator) specific. For example maybe you want to process ALL archive files in your entire company intranet... then yes you may wish to fetch those types of resources. I'm sorry I don't have a better response. I don't think there is any surefire, static final answer for this question. It really depends. Nutch provides configuration to skip and truncate large content. > Why would a protocol plugin know about the content type? We need to capture the content type from the response (result of making the request over said Protocol implementation). > What should Metadata contain, and where will it be stored and used? Metadata is described as being "...A multi-valued metadata container", so theoretically it could/can contain any key, value pair you wish _however_ as I detailed elsewhere in one of my responses, at the Protocol plugin level, the Metadata would content key, value pairs relevant to the acquisition of the Content over the given Protocol. A simple example could be metadata.set(Response.CONTENT_TYPE, "text/plain; charset=UTF-16"); > I > have not seen the indexer forwarding the metadata to solr for example. Correct. Protocol-level Metadata is NOT forwarded to Solr it is instead retained in the CrawlDB. Metadata that can be forwarded to Solr typically relates to the parse result of processing the Content. > What configuration should be passed into Content? As I detailed above, the Configuration object is used by MimeUtil. The properties used are * mime.types.file - Name of file in CLASSPATH containing filename extension and magic sequence to mime types mapping information. Overrides the default Tika config if specified. If a file can't be found then Tika's default one is used * mime.type.magic - Defines if the mime content type detector uses magic resolution. For more information see https://tika.apache.org/2.9.2/detection.html#Mime_Magic_Detection > > > Some explanatory javadoc or wiki page would be much appreciated. We can provide a patch for https://issues.apache.org/jira/browse/NUTCH-3074 I hope my responses are somewhat on point and offer some value. There are other on here who know much more about the Nutch codebase than me but I tried to give you a detailed answer. lewismc

