Hi Hiran,

few comments to your questions and Lewis' exhaustive answer...


Nutch was started in 2002 as a distributed web crawler and search engine
project. Many design decisions need to be viewed from that historic perspective.

For example, the 2 GiB limit of the content: in 2002 nobody would have thought
that this is or even might become an issue. Changing it now in a backward-compatible way wouldn't be trivial.


Why would a protocol plugin know about the content type?

Because the HTTP protocol defines the Content-Type HTTP header. Basically,
HTTP headers are multi-valued metadata which helps or makes it possible to
process requests and responses. The Nutch Content class and the protocol implementation are definitely influenced by the layout of the HTTP protocol itself.


Best,
Sebastian

Reply via email to